JC13R8C'dPCT/PT0 27 DEC 200) 



FORM-FTO-l 390 U.S. DEPARTMENT OF COMMERCE PATENT AND TRADEMARK OFFICE 

(Rev. 9-2001) 

TRANSMITTAL LETTER TO THE UNITED STATES 
DESIGNATED/ELECTED OFFICE (DO/EO/US) 
CONCERNING A FILING UNDER 35 U.S.C. 371 / 



ATTORNEY'S DOCKET NUMBER 

006838-079 



U.S. AQ^ICATION NO. (If known, see 37 C.F.R. 1.5) 

1^019059 



Una^ 



INTERNATIONAL APPLICATION NO. 
PCT/IE00/00Q83 



INTERNATIONAL FILING DATE 

28 June 2000 



PRIORITY DATE CLAIMED 
28 June 1999 



TITLE OF INVENTION 

LOGIC EVENT SIMULATION 



APPLICANT(S) FOR DO/EO/US 
Damian DALTON 



Applicant herewith submits to the United States Designated/Elected Office (DO/EO/US) the following items and other information: 

1 . El This is a FIRST submission of items concerning a filing under 35 U.S.C. 371 . 

2. .□ This Is a SECOND or SUBSEQUENT submission of items concerning a filing under 35 U.S.C. 371 . 

3. El This is an express request to begin national examination procedures (35 U.S.C. 371 (f)). The submission must include items (5), (6), 
(9) and (21) indicated below, 

4. El The US has been elected by the expiration of 1 9 months from the priority date (Article 31 ). 

5. 13 A copy of the International Application as filed (35 U.S.C. 371 (c)(2)) 

a. n is attached hereto {required only if not communicated by the International Bureau), 
g , b. El has been communicated by the International Bureau. 

Q c. n is not required, as the application was filed in the United States Receiving Office (RO/US). 
6p □ An English language translation of the International Application as filed (35 U.S.C. 371(c)(2)) 
\r a., n * is attached hereto. 

iS b. n has been previously submitted under 35 U.S.C, 1 54(d)(4). 
ylJl El Amendments to the claims of the International Application under PCT Article 1 9 (35 U.S.C. 371 (c)(3)) 
J a. d are attached hereto (required only if not communicated by the International Bureau). 

b. n have been communicated by the International Bureau. 

5 c. n have not been made; however, the time limit for making such amendments has NOT expired. 
iM= d. El have not been made and will not be made. 

An English language translation of the amendments to the claims under PCT Article 1 9 (35 U.S.C. 371 (c)(3)). 
An oath or declaration of the inventor(s) (35 U.S.C. 371(c)(4)). 

An English language translation of the annexes to the International Preliminary Examination Report under PCT Article 36 (35 U.S.C. 
371(c)(5)). 

to 20 below concern document(s) or information included: 

An Information Disclosure Statement under 37 CFR 1 .97 and 1 .98. 

An assignment document for recording. A separate cover sheet in compliance with 37 CFR 3.28 and 3.31 is included. 
A FIRST preliminary amendment. 
A SECOND or SUBSEQUENT preliminary amendment. 
A substitute specification. 

A change of power of attorney and/or address letter. 

A computer-readable form of the sequence listing in accordance with PCT Rule 13ter.2 and 35 U.S.C. 1 .821 - 1 .825. 
A second copy of the published international application under 35 U.S.C. 154(d)(4). 

A second copy of the English language translation of the intemational application under 35 U.S.C. 1 54(d)(4). 
Other items or information: International Search Report; Intemational Preliminary Examination Report 





□ 

El 


10, 


□ 


Items 1 1 


11. 


□ 


12. 


□ 


13. 


IS 


14. 


□ 


15. 


□ 


16. 


□ 


17. 


□ 


18. 




19. 


□ 


20. 


El 



21839 



(10/01) 



aOISBiC'dPCT/PTO 27 DEC 2001 



U.S. APPLICATION N0.lF!fawn/s^7^,F@1^ C Q 

Unassigned JU/Ul7UP7 



INTERNATIONAL APPLICATION NO. 

PCT/IEOO/00083 



ATTORNEY'S DOCKET NUMBER 

006838-079 



21. 



The following fees are submitted: 



CALCULATIONS 



PTO USE ONLY 



Basic National Fee (37 CFR 1.492(a){1)-(5)): 



$1,040.00(960) 

International preliminary examination fee (37 CFR 1 .482) not paid to ^^^^ ^^^^^ 

USPTO but International Search Report prepared by the EPO or JPO $890.00 (970) 



Intemational preliminary examination fee (37 CFR 1 .482) not oaid to USPTO 

but international search fee (37 CFR 1.445(a)(2)) paid to USPTO $740.00 (958) 

International preliminary examination fee (37 CFR 1 .4821 paid to USPTO 

but all claims did not satisfy provisions of PCT Article 33(1 )-(4) $710.00 (956) 

International preliminary examination fee (37 CFR 1. 482) paid to USPTO 

and all claims satisfied provisions of PCT Article 33{1)-(4) $100.00 (962) 



ENTER APPROPRIATE BASIC FEE AMOUNT = 



890.00 



amsltWgaaytlilU^etfa^^^^^^^ 



tn later than 



20 □ 30 □ 



Claims 



Number Filed 



Number Extra 



Rate 



Total Claims 



18-20 = 



X$ 18.00 (966) 



Independent Claims 



2-3 = 



X$84.00 (964) 



Multiple dependent claim(s) (if applicable) 



+ $280.00 (968) 



TOTAL OF ABOVE CALCULATIONS = 



890.00 



ReSifetlon for 1/2 for filing by small entity, if applicable (see below). 



SUBTOTAL 



lation later than 20 CI 30 CI 



890.00 



TOTAL NATIONAL FEE = 



890.00 



FeOor recording the enclosed assignment (37 CFR 1.21(h)>, The assignment must be accompanied by 
an appropriate cover sheet (37 CFR 3.28, 3.31). $40.00 (581) per property 



TOTAL FEES ENCLOSED = 



890.00 



Amount to be 
refunded: 





a. 


□ 


b. 


lEI 


c. 


□ 


d. 





charged; 



A check in the amount of $ 890.00 to cover the above fees is enclosed. 

Please charge my Deposit Account No. 02-4800 in the amount of $ to cover the above fees. A duplicate copy of this sheet 

Is enclosed. 

The Commissioner is hereby authorized to charge any additional fees which may be required, or credit any overpayment to Deposit 
Account No. 02-4800 . A duplicate copy of this sheet is enclosed. 
NOTE: Where an appropriate time limit under 37 CFR 1.494 or 1.495 has not been met, a petition to revive (37 CFR 1.137(a) or (b)) 
must be filed and granted to restore the applfcation to pending status. 



SEND ALL CORRESPONDENCE TO: 
Platen N. Mandros 

Burns, Doane, Swecker & mathis, l.l.p. 
P.O. Box 1404 ^^^^^ 
Alexandria, Virginia 2231 3-1 404 
(703) 836-6621) 



signature 
Platon N. Mandros 



NAME 

22,124 

REGISTRATION NUMBER 



December 27. 2001 



DATE 



(10/01) 



JC1SB@e'd 



PCT/PTO 27 DEC 20! 



Patent 

Attorney's Docket No. 006838-079 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

In re Patent Application of ) 



Filed: December 27, 2001 ) 
For: LOGIC EVENT SIMULATION ) 

PRELIMINARY AMENDMENT 

Assistant Conunissioner for Patents 
Washington, D.C. 20231 

Sir: 

Prior to examination of the above-captioned patent application, kindly enter the 
following amendment. 

IN THE CLAIMS: 

Kindly replace Claims 3, 4, 5, 6, 7, 8, 10, 11, 12, 14 and 15 as follows, 
3 . A method as claimed in claim 1 in which the hit list is segmented into a 
plurality of separate smaller hit lists, each connected to a separate scan register and in 
which each scan register is operated in parallel to transfer the results to the output register. 



Damian Dalton 



Group Art Unit: Unassigned 



Application No.: Unassigned 



Examiner: Unassigned 



4. A method as claimed in claim 1 in which the associative register (lb) is 
divided into separate smaller associative sub-registers, one type of logic gate being 
allocated to each associative sub-registers, each of which associative sub-registers ha 
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corresponding sub-registers connected thereto whereby gate evaluations and tests are 
carried out in parallel on each associative sub-register. 

5. A method as claimed in claim 1 in which each line signal to a target logic gate 
is stored as a plurality of bits each representing a delay of one time period, the aggregate 
bits representing the delay between signal output to and reception by the target logic gate 
and in which the mherent delay of each logic gate is represented in the same manner. 

6. A method as claim in claim 4 in which each associative sub-register is used to 
form a hit list connected to a corresponding separate scan register. 

7. A method as claimed in claim 1 in which where the number of the one type of 
logic gate exceeds a predetermined number more than one sub-register is used. 

8. A method as claimed in claim 3 in which the scan registers are controlled by 
exception logic using an OR gate whereby the scan is terminated for each register on the 
OR gate changing state thus indicating no further matches. 

10. A method as clauned in claim 1, in which each line signal to a target logic gate 
is stored as a plurality of bits each representing a delay of one time period, the aggregate 
bits representing the delay between signal output to and reception by the target logic gate. 
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11. A method as claimed in claim 1 in which there is an initialisation phase in 

which; 

specified signal values are inputted; 
unspecified signal values are set to unknown; 

test templates are prepared defining the delay model for each logic gate; 
the input circuit is parsed to generate an equivalent circuit consisting of 2-input 
logic gates; and 

the 2-input logic gates are then configured. 



12. A method as claimed in claim 1 in which a multi-valued logic is applied and in 
which n bits are used to represent a signal value at any instance in time with n being any 
arbitrarily chosen logic. 

14. A method as claimed in claim 12 in which the sequence of values on a logic 
gate is stored as a bit pattern forming a unique word in the associative memory mechanism 
(la, lb). 

15. A method as claimed in claim 1 in which there is stored a record of all values 
that a logic gate has acquired for the units of delay of the longest delay in the circuit. 
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REMARKS 
No new matter has been introduced. 

The wording in Claims 3, 4, 5, 6, 7, 8, 10, 11, 12, 14 and 15 have been amended to 
avoid multiple dependency. 

Early and favorable consideration with respect to this application is respectfully 
requested. 

These changes have been made in accordance with 37 C.F.R. § 1.121 as amended 
on November 7, 2000. Marked-up versions of Claims 3, 4, 5, 7, 8, 9, 10, 12, 13, 16, 19, 
20, 21, 22, 24, 25, 28, 29, 32, 34 and 35 indicatmg the changes is enclosed with this 
Prelhninary Amendment. 

Should any questions arise in connection with this application, the undersigned 
respectfully requests that he be contacted at the number indicated below. 

Respectfully submitted. 

Burns, Doane, Swecker& Mathis, l.l.p. 



P.O. Box 1404 

Alexandria, Virginia 22313-1404 
(703) 836-6620 



By: 

Platon N. Mandros 
Registration No. 22,124 



Date: December 27, 2001 
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Attach ment to Preliminary Amendment dated December 27> 2001 

Marked-up Claims 

3 . A method as claimed in claim 1 [or 2] in which the hit list is segmented into a 
plurality of separate smaller hit lists, each connected to a separate scan register and in 
which each scan register is operated in parallel to transfer the results to the output register. 

4. A method as clauned in [any of claims 1 to 3] claim 1 in which the associative 
register (lb) is divided into separate smaller associative sub-registers, one type of logic gate 
being allocated to each associative sub-registers, each of which associative sub-registers ha 
corresponding sub-registers connected thereto whereby gate evaluations and tests are 
carried out in parallel on each associative sub-register. 

5. A method as claimed in [any of claims 1 to 4] claim 1 in which each line signal 
to a target logic gate is stored as a plurality of bits each representing a delay of one time 
period, the aggregate bits representing the delay between signal output to and reception by 
the target logic gate and in which the inherent delay of each logic gate is represented in the 
same manner. 

6. A method as claim in claim 4 [or 5] in which each associative sub-register is 
used to form a hit list coimected to a corresponding separate scan register. 
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Attachment to Preliminary Amendment dated December 27> 2001 

M arked-up Claims 

7. A method as claimed in [any of claims 1 to 6] claim 1 in which where the 
number of the one type of logic gate exceeds a predetermined number more than one sub- 
register is used. 

8. A method as claimed in [any of claims 3 to 7] claim 3 in which the scan 
registers are controlled by exception logic using an OR gate whereby the scan is terminated 
for each register on the OR gate changing state thus indicating no further matches. 

10. A method as claimed in [any preceding claim] claim 1 , in which each line 
signal to a target logic gate is stored as a plurality of bits each representing a delay of one 
time period, the aggregate bits representing the delay between signal ou^ut to and 
reception by the target logic gate. 

11. A method as claimed in [any preceding claim] claim 1 in which there is an 
initialisation phase in which: 

specified signal values are inputted; 
unspecified signal values are set to unknown; 

test templates are prepared defining the delay model for each logic gate; 
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Attachmept to Preliminary Amendment date d December 27, 2001 

M arked-up Claims 

the input circuit is parsed to generate an equivalent circuit consisting of 2-input 

logic gates; and 

the 2-input logic gates are then configured. 

12. A method as clauned m [any preceding claim] claim 1 in which a multi-valued 
logic is applied and in which n bits are used to represent a signal value at any instance in 
time with n bemg any arbitrarily chosen logic. 

14. A method as claimed in claim 12 [or 13] in which flie sequence of values on a 
logic gate is stored as a bit pattern forming a unique word m the associative memory 
mechanism (la, lb). 

15. A method as claimed in [any preceding claim] claim 1 m which there is stored 
a record of all values that a logic gate has acquired for the units of delay of the longest 
delay in the circuit. 
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"Logic Event Simulaaon"_ 

Introducfion 

5 The present invention is directed towards a parallel processing method of logic 
simulation comprising representing signals on a llT\e over a time period as a bit 
sequence, evaluating the output of any J0gic gate induding^n eveauation of any 
inherent delay by a comparison between the bit sequences of its inputs to a 
predetermined series of bit patterns and in which those iogic gates whose outputs 

10 have changed over the time period are identified during the evaluation of the gate 
outputs as real gate changes and only those real gate changes are propagated to fan 
out gates and in which the control of the method is carried out in an associative 
memory mechanism which stores in word form a history of gate input signals by 
compiling a hit list register of logic gate state changes and using a multiple response 

15 resolver forming part of the associative memory mechanism which generates an 
address for each hit, and then scans and transfers ttie results on the hit list to an 
output register for subsequent use. The output register may contain the final result 
of the simulation or may be a list of outputs to be used for subsequent fan out to 
other gates. Further, the invention is directed towards providing a parallel processor 

20 for logic event simulation (APPLES). 

Logic simulation plays an important role in the design and validation of VLSI circuits. 
As circuits increase in size and complexity, there is an ever demanding requirement 
to accelerate the processing speed of this design tool. Parallel processing has been 

25 perceived in industry as the best method to achieve ttiis goal and numerous parallel 
processing systems have been developed. Unfortunately, large speedup figures 
have eluded these approaches. Higher speedup figures have been achieved, but 
only by compromising the accuracy of the gate delay model employed in these 
systems. A large communication overhead due to basic passing of values between 

30 processors, elaborate measures to avoid or recover from deadlock and load 
balancing techniques, is the principal barrier. 



The ever-expanding size of VLSI (Very Large Scale Integration) circuits has further 
emphasised the need for a fast and accurate means of simulating digital circuits. A 
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compromise between model accuracy and computational feasibility is found in logic 
simulation. In this simulation paradigm, signal values are discrete and may acquire 
in the simplest case logic values 0 and 1 . More complex transient state signal 
values are modelled using up to 9-state logic. Logic gates can be modelled as ideal 
5 components with zero switching time or more realistically as electronic components 
witii finite delay and switching characteristics such as inertial, pure or ambiguous 
delays. 

Due to the enormity of the computational effort for large circuits, the application of 
10 parallel processing to this problem has been explored. Unfortunately, large 
speedup performance for most systems and approaches have been elusive. 

Sequential (uni-processor) logic simulation can be divided into two broad 
categories Compiled code and Event-driven simulation {Breur et al: Diagnosis and 
15 Reliable Design of Digital Systems. Computer-Science Press, New York (1976)). 
These techniques can be employed in a parallel environment by partitioning the 
circuit amongst processors. In compiled code simulation, all gates are evaluated at 
all time steps, even if they are not active. The circuit has to be levellised and only 
unit or zero delay models can be employed. Sequential circuits also pose difficulties 
20 for this type of simulation. A compiled code mechanism has been applied to several 
generations of specialised parallel hardware accelerators designed by IBM, tiie 
Logic Simulation Machine LSM (Howard et al: Introduction to tiie IBM Los Gatos 
Simulation Machine. Proc IEEE Int. Conf. Computer Design: VLSI In Computers. 
(Oct 1983) 580-583), the Yorktown Simulation Engine (Pfister: The Yorktown 
25 Simulation Engine. Introduction 19*" ACM/IEEE Design Automation Conf, (June 
1982), 51-54) and the Engineering Verification Engine EVE (Dunn: IBM's 
Engineering Design System Support for VLSI Design and Verification. IEEE 
Design and Test Computers, (February 1984) 30-40 and performance figures as 
high as 2.2 billion gate evaluations/sec reported. Agrawal et al: Logic Simulation 
30 and Parallel Processing Inti Conf on Computer Aided Design (1990), have analysed 
the activity of several circuits and their results have indicated that at any time 
instant circuit activity (i.e. gates whose outputs are in transition) is typically in tiie 
range 1% to 0.1%. Therefore, the effective number of gate evaluations of these 
engines is likely to be smaller by a factor of a hundred or more. Speedup va/ues 
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ranging from 6 to 13 for various compiled coded benchmark circuits have been 
observed on the shared memory MIMD Encore Multimax multiprocessor by Soule 
and Blank: Parallel Logic Simulation on General purpose machines. Proc Design 
Automation Conf, (June 1988), 166-171. A SIMD (array) version was investigated 
5 by Kravitz (Mueller-Thuns et al: Benchmarking Parallel Processing Platforms: An 
Application Perspective. IEEE Trans on Parallel and Distributed systems, 4 No. 8 
(Aug 1993) with similar results. 

The intrinsic unit delay model of compiled code simulators is overly simplistic for 
10 many applications. 

Some delay model limitations of compiled code simulation have been eliminated in 
parallel event-driven techniques. These parallel algorithms are largely composed of 
two phases; a gate evaluation phase and an event-scheduling phase. The gate 
15 evaluation phase identifies gates tfiat are changing and tine scheduling phase puts 
the gates affected by these changes (tiie fan-out gates) into a time-ordered linked 
schedule list, detennined by the current time and the delays of tine active gates. 
Soule and Blank: Parallel Logic Simulation on General purpose machines. Proc 
Design Automation Conf, (June 1988), 166-1 71 and Mueller-Thuns et al: 
20 Benchmarking Parallel Processing Platforms: An Application Perspective. IEEE 
Trans on Parallel and Distributed systems, 4 No 8 (Aug 1993) have investigated botin 
Shared and Distributed memory Synchronous event MIMD architectures. Again, 
overall performance has been disa^>pointing the results of several benchmarks 
executed on an 8-processor Encore Multimax and an 8-processor iPSC-Hypercube 
25 only gave speedup values ranging from 3 to 5. 

Asynchronous event simulation pemiits limited processor autonomy. Causality 
constraints require occasional synchronisation between processors and rolling back 
of events. Deadlock between processors must be resolved. Chandy, Misra: 
30 Asynchronous Disti-ibuted Simulation via Sequence of pau-allel Computations. Comm 
ACM 24(ii) (April 1981), 198-206 and Bryant: Simulation of Packet Communications 
Architecture Computer Systems. Tech report MIT-LCS-TR-188. MIT Cambridge 
(1977) have developed deadlock avoidance algoritiims, while Briner. Parallel Mixed 
Level Simulation of Digital Circuits Virtual Time. Ph.D. tinesis. Dept of El. Eng. Duke 
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University, (1990) and Jefferson: Virtual time. ACM Trans Programming languages 
systems, (July 1985) 404-425 have explored algorithms based on deadlock recovery. 
The best speedup performance figures for Shared and Distributed memory 
asynchronous MIMD systems were 8.5 for a 14-processor system and 20 for a 32- 
5 processor BBN system. 

Optimising strategies such as load balancing, circuit partitioning and distributed 
queues are necessary to realise the best speedup figures. Unfortunately, these 
mechanisms themselves contribute large Overhead communication costs for even 
modest sized parallel systems. Furthermore, the gate evaluation process despite 
its small granularity. Incurs between 10 to 250 machine cycles per gate evaluation. 

Statements of Invention 

The invention comprises a method and a processor for an Associated Parallel 
Processor for Logic Event Simulation; the processor is refen-ed to in this 
specification as APPLES, and is specifically designed for parallel discrete event logic 
simulation and for carrying out such a parallel processing method. In summary, the 
invention provides gates evaluations in memory and replaces interprocessor 
communication with a scan technique. Further, the scan mechanism is so arranged 
as to facilitate paralielisation and a wide variety of delay models may be used. 

Essentially, there is therefore provided a parallel processing method of logical 
simulation comprising representing signals on a line over a time period as a bit 
25 sequence, evaluating tiie output of any logic gate including an evaluation of any 
inherent delay by a comparison between the bit sequences of its inputs to a 
predetermined series of bit patterns and in which tiiose logic gates whose outputs 
have changed over the time period axe identified during the evaluation of the gate 
outputs as real gate changes and only ti^ose real gate changes are propagated to fan 
30 out gates. The control of the metiiod is carried out in an associative memory 
mechanism which stores in word form a history of gate input signals by compiling a 
hit list register of logic gate state changes and using a multiple response resolver 
forming part of the associative memory mechanism which generates an address for 
each hit, and then scans and transfers the results on the hit list to an output register 



10 



15 



20 
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for subsequent use. 

One of the core features of the invention is the segmentation or cli>4sion of at least 
one of the registers or hit lists into smaller registers or hit lists to reduce 
5 computational time. The other feature of considerable importance is the handling of 
line signal propagation by modelling signal delays. Finally the method according to 
the invention allows simulation to be carried out over arbitrarily chosen time periods. 

Either the associated register is divided into separate smaller associative sub- 
10 registers, one type of logic gate being allocated to each associative sub-register, 
each of which associative sub-registers has corresponding sub-registers connected 
thereto whereby gate evaluations and tests are carried out in parallel on each 
associative sub-register. 

15 Alternatively it is possible to achieve a satisfactory simulation particularly where the 
circuit being simulated is not too large by segmenting the hit list into a plurality of 
separate smaller hit lists each connected to a separate scan register in this case 
each scan register is operated in parallel to transfer the results to the output register. 
This gets over the particular computational problem in these parallel processors and 

20 speeds up the whole simulation considerably. 

Further, the invention provides a parallel processor for logic event simulation 
(APPLES) which essentially has an associated memory mechanism which comprises 
a plurality of separate associative sub-registers each for the storage in word form of a 
25 history of gate input signals for a specified type of logic gate. Furtiier, there Is a 
number of separate additional sub-registers associated with each associative sub- 
register whereby gate evaluations and tests can be carried out in parallel on each 
associative sub-register. 

30 In the method according to the invention, each associative sub-register is used to 
form a hit list connected to a corresponding separate scan register. 

ideally, when there are a number of sub-registers and the number of the one type of 
logic gate exceeds a predetermined number, more than one sub-register is used. 
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Ideally, the scan registers are controfled by exception logic using an OR gate 
whereby the scan is terminated for each register on the OR gate changing state thus 
indicating no further matches. The predetermined number will be determined by the 
computational load. 

5 

The scan can be carried out in many ways but one of the best ways of carrying it out 
is by sequential counting through the hit list and when this is done, generally the 

"V. 

steps are performed of:- 
1 0 checking if the bit is set indicating a hit; 

if a hit, determining the address effected by that hit; 
storing the address; 

15 

clearing the bit in the hit list; 

moving to the next position in the hit list; and 

20 repeating the above steps until the hit list is cleared. 

Obviously where fan out occurs subsequently more than one address will be effected. 

In one particular embodiment of tine invention, there is provided such a parallel 
25 processing method of logic simulation in which each line signal to a target logic gate 
is stored as a plurality of bits each representing a delay of one time period, the 
aggregate bits representing the delay between signal output to and reception by the 
target logic gate and in which the inherent delay of each logic gate is represented in 
the same manner. The time period is arbitrarily chosen and will often be of the order 
30 of 1 nanosecond or less. The fact that the time period can be arbitrarily chosen is of 
immense importance since it is possible to simulate a circuit for a plurality of different 
time periods. Additionally the affect of the delay inherent in the transfer of line signal 
between logic gates is becoming more important as the response time of the 
components of circuits reduce. 
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in this latter embodiment, each delay is stored as a delay word in an associative 
memory forming part of the associative memory mechanism in which:- 

5 the length of the delay word is ascertained; and 

if the delay word widtii exceeds the associative register word width:- 

the number of integer multiples of the register word width contained within the 
1 0 delay word is calculated as a gate state; 

the gate state is stored in a further state register; 

the remainder from the calculation is stored in the associative register vwth 
1 5 those delay words whose widths did not exceed the associative register word 

width; and 

on the count of the associative register commencing:- 

20 the state register is consulted for the delay word entered in the state register 

and the remainder is ignored for this count of the associative register; 

at the end of the count of the associative register, the state register is 
updated; and 

25 

the count continues until the remainder represents the count still required. 

For carrying out the invention, an initialisation phase is carried out in which 
specified signal values are inputted, unspecified signal values are set to unknown, 
30 test templates are prepared defining the delay model for each logic gate, the input 
circuit Is parsed to generate an equivalent circuit consisting of 2-input logic gates, and 
the 2-input logic gates are then configured. 
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With the present invention, multi-valued logic may be applied and in this situation, n 
bits are used to represent a signal value at any instance in time with n being any 
arbitrarily chosen logic. A particularly suitable one is an 8-valued logic in which 000 
represents logic 0, 111 represents logic 1 and 001 to 110 represent arbitrarily defined 
5 other signal states. 

One of the features of the invention is that the sequence of values on a logic gate is 
stored as a bit pattern forming a unique word in the associative memory mechanism 
and by doing this it is possible to store a record of all values that a logic gate has 
1 0 acquired for the units of delay of the longest delay in the circuit. 

Detailed Description of the Invention 

The invention will be more clearly understood from the following description of 
15 embodiments thereof given by way of example only witii reference to ttie 
accompanying drawings in which:- 

Fig. 1 illustrates the functions of blocks of the APPLE processor; 

20 

Fig. 2 illustrates the inertial delay mechanism in the APPLE system; 
Fig. 3 is an illustration of a simulated cycle; 

25 

Fig. 4 is a test search pattern; 

Fig, 5 is an illustration of the logical combination mechanism according to 
30 the invention, 

Fig. 6 illustrates components active during a gate evaluation phase, 

Fig. 7 is bit patterns for an ambiguous delay model and hazard detection, 

35 

Fig. 8 is an outline of an alternative arrangement of processors according to 
the invention; 
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Fig. 9 illustrates the structure of one processor in more detail; and 

5 Fig. 10 is a view similar to Fig. 1 of the alternative construction of processor. 

The essential elemental tasks for parallel logic simulation are: 

1 . Gate evaluation. 

2. Delay model implementation. 

3. Updating fan-out gates. 

The design framework for a specific parallel logic simulation architecture originated 
by identifying the essential elemental simulation operations which can be performed 
in parallel and by minimising the tasks that support these operations and which are 
totally intrinsic to the parallel system. 

Activities such as event scheduling and load balancing are perceived as 
implementation issues which need not be incorporated necessarily into a new 
design. An important additional critique is that the design must execute directly in 
hardware as many parallel tasks as possible, as fast as possible but without limiting 
the type of delay model. 

The present invention, taking account of the above objectives, incorporates several 
special associative memory blocks and hardware in the APPLES architecture. 

The gate evaluation/delay model implementation and Update/Fan-out process will 
be explained with reference to the APPLES architecture with reference to Fig. 1 . 
30 

Referring to Fig. 1, the functional blocks of the APPLES processor are shown. The 
blocks pertinent to gate evaluation are associative array la 1, input-value-register 
bank 2, associative array 1b, test-result-register bank 4, group-resutt register bank 
5 and the group-test hit list 6. The group test hit list in turn feeds a multiple 
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response resolver 7 which in turn feeds a fan out memory 8 to an address register 
9 connected to the input value register bank 2. The associative array 1 has an 
associative mask register 1a and input register la while the associative array lb 
has a mask register 1b and an input register lb. Similarly, the test result register 
5 bank 4 has a result active register 14 and the group result register bank 5 has a 
mask register 15 and an input register 16. Rnally, an input value register bank 17 
is provided. Apart from the associative arrays, the group-result register bank has 
parallel search facilities. Regardless of the number of words in these structures 
can be searched in parallel in constant time. Furthermore, the words in the input- 
10 value-register bank 17 and associative array lb can be shifted right in parallel while 
resident in memory. 

A gate can be evaluated once its input wire values are known. In conventional uni- 
processor and parallel systems these values are stored In memory and accessed by 

15 the processor(s) when the gate is activated, in APPLES, gate signal values are 
stored in associative memory words. The succession of signal values that have 
appeared on a particular wire over a period of time are stored in a given associative 
memory word in a time ordered sequence. For instance, a binary value model could 
store in a 32-bit word, the history of wire values that have appeared over the last 32 

20 time intervals. Gate evaluation proceeds by searching in parallel for appropriate 
signal values in associative memory. Portions of the words which are irrelevant (e.g. 
only tine 4 most recent bits are relevant for a 4-unit gate delay model) are masked out 
of the search by the memory^s input and mask register combination. For a given gate 
type (e.g. And, Or) and gate delay model there are requirements on the structure of 

25 the input signals to effect an output change. Each pattern search in associative 
memory detects those signal values that have a certain attribute of the necessary 
structure (e.g. Those signals which have gone high within the last 3 time units). 
Those wires that have all the attributes indicate active gates. The wire values are 
stored in a memory block designated associative array 1 b{word-line-register bank). 

30 Only those gate types relevant to the applied search patterns are selected. This is 
accomplished by tagging a gate type to each word. These tags are held in 
associative array la. A specific gate type is activated by a parallel search of the 
designated tag in associative Arrayla. 
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This simple evaluation mechanism implies that the wires must be identrfied by the 
type of gate into which they flow since different gate types have different input wire 
sequences that activate them. Gates of a certain type are selected by a parallel 
search on gate type identifiers in associative array 1a, 

5 

Each signal attribute corresponds to a bit pattern search in memory. Since several 
attributes are normally required for an activated gate, the result of several pattern 
searches must be recorded. These seardies can be considered as tests on words. 

10 The result of a test is either successful or not. This can be recorded as single bit in a 
corresponding word in another register held in a register banl^ termed the test-result- 
register bank. Since each gate is assumed to have two inputs (inverters and multiple 
input gates are translated into their 2-input gate circuit equivalents) tests are 
combined on pairs of words in ttiis bank. This combination mechanism is specific to 

15 a delay model and defined by the result-activator register and consists of simple 
AND or OR operation between bits in the word pairs. 

The results of each combining each word pair, the final stage of the gate evaluation 
process, are stored as a single word in another associative array, the group-result 
20 register Bank 5. Active gates wilt have a unique bit pattern In this bank and can be 
identified by a parallel search for this bit pattern. Successful candidates of this search 
set their bit in the 1-bit column register group-test hit list. 

The bits in each column position of every gate pair in the test-result register bank 4 
25 are combined in accordance to the logic operators defined in the result-activator 
register. The bits in each column are combined sequentially in time in order to reduce 
the number of output tines in the test- result-register bank 4. Thus, there is only one 
output line required for each gate pair in the test-result register bank, instead of one 
wire for each column position. 

30 

The result of the combination of gate pairs in the test-result register bank 4 are 
written column by column into the group-result register bank 5. Only one column in 
parallel is written at a particular clock edge. This implies only one input wire to the 
group-result register bank 5 is required per gate pair in the test-result register bank. 
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This reduces the number of connections from the test-result register bank to the 
group-result register bank. 

The scan registers are independent in so far as they can be decremented or 
5 incremented while other scan registers are disabled, however they are clocked in 
unison by one clock signal. 

The optimum number of scan registers is given by the inverse of the probability of a 
hit being detected in the hit list. 

10 

It is essential that an OR operations of all bits in the Hit-list is computed on one 
edge of a clock period to detemiine when all hit bits are clear and on the converse 
edge of the same clock cycle any scan register that is given access to its fan-out 
list is permitted to clear the hit bit that it has detected. The access is controlled by a 
1 5 wait semaphore system to ensure only one access at a time is made to each single 
ported memory. 

An alternative system consists of a multi-ported fan-out memory, consisting of 
several memory banks each of which can be simultaneously accessed. Each 
20 memory bank in the system has its own semaphore control mechanism. 

An alternative strategy has a hit bit enable the inputs of its fan-out list in the Input- 
value register. The enable connections from the hit list to the appropriate elements 
in the Input-value register bank are made prior to the commencement of the 

25 simulation and are determined by the connectivity between the gates in the circuit 
being simulated. These connections can be made by a dynamically configured 
device such as an FPGA (Field Programmable Gate Array) which can physically 
route the hit list element to its fan-out inputs. In the process all active Fan-out 
elements so connected will be enabled simultaneously and updated with the same 

30 logic value in parallel. 

The control core consists of a synchronised self-regulated sequence of events 
identified in one example, the Verilog code as eO, e1 , e2 etc. An event corresponds 
to the completion of a major task. The self-regulation means that there is no 
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software controlling the sequence of events, although there may be software 
external to the processor which will solicit information concerning the status of the 
processor. Furthermore, it implies that there is no microprogramming involved in 
the design. This eliminates the need for a microprogrammed unit and increases the 
5 speed of processing. 

In the fan-out update actisrity controlled, for example, by e20, it is essential that the 
event that the Multiple response resolver 7 has no more hits to be detected, 
terminates this activity. There is a choice that this activity be terminated by the 
10 event that all the hit-list has been scanned. However, detection that no more hits 
exist can terminate prematurely this fan-out update procedure and leads to a faster 
execution time of this procedure. 

Some logic entities may have delays which exceed the time frame representable in 
15 the word of associative array lb. Larger delays can be modelled by associating a 
state with a gate type, tn this case a gate and its state are defined in associative 
array 1a. Tests are performed on associative array lb and when a gate with a 
given state passes some input value critique in addition to the fan-out components 
of the gate possibly being affected, the Gate state is amended In Associative array 
20 1 a. This new state may also cause a new output value to be ascribed to the fan-out 
list of the gate. The tests that are applied are determined by the gate type and 
state, tn this mechanism the fan-out list of a gate includes the normal fan-out 
inputs and the address in associative array la of the gate itself. 

25 In order to determine whether the state or the state and the fan-out gates are to be 
updated the state (a binary value) can serve as an offset into the gate's fan-out 
update data files. The state is added to the start location of each of a gates data files 
and this enables the gates normal fan-out list to be bypassed or not. 

30 The interconnect between logic entities being simulated can be modelled using a 
large delay model described below. Furthermore, single wires can be modelled by 
one word instead of two in associative array 1 a , associative array 1 b and the test- 
result register bank 4. Branch points are modelled as separate wires permitting 
different branch points to have different delay characteristics. 
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An efficient implementation uses single word versions of associative array la, 
associative array 1b and the test-result register bank. 

5 The APPLES gate evaluation mechanism selects gates of a certain type, applies a 
sequence of bit patterns searches (tests) to them and ascertains the active gates by 
recording the result of each pattern search and determining those that have fulfilled 
all the necessary tests. This mechanism executes gate evaluation in constant 
time — the parallel search is independent of the number of words. This is an effective 
10 linear speedup for the evaluation activity. It also facilitates different delay models 
since a delay model can be defined by a set of search patterns. Further discussion of 
this is given below. 

Active gates set their bits in the column hit list. A multiple response resolver scans 
1 5 through Ms list. The multiple resolver can be a single counter which inspects the 
entire list from top to bottom which stops when it encounters a set bit and then uses 
its current value as a vector for the fan-out list of the identified active gate. This list 
has the addresses of the fan-out gate inputs in an input-value register bank. The new 
logic value of the active gates are written into the appropriate word of this bank. 

20 

It then clears the bit before decrementing through the remainder of the list and 
repeating this process. Ail hit bits are Ored togetfier so tiiat when ail bits are clear. 
This can be detected immediately and no further scanning need be done. 

25 Several scan registers can be used in the multiple response resolver to scan the 
column hit list in parallel. Each operates autonomously except when two or more 
registers simultaneously detect a hit; a clash has occurred. Then each scan 
register must wait until it is arbitrarily allowed to access and update its fan-out list 
Each register scans an equal size portion. The frequency of clashes depends on 

30 the probability of a hit for each scan register, typically this probability is between 
0.01 and 0,001 for digital circuits. The timing mechanism in APPLES enables only 
active gates to be identified and the multiple scan register structure provides a 
pipeline of gates to be updated for the current time interval without an explicit 
scheduling mechanism. The scheduler has been substituted by this more efficient 
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parallel scan procedure. 

When all gate types have been evaluated for the current time interval all signals are 
updated by shifting in parallel the words of the Input-value register into the 
5 corresponding words of the word-line register bank. For 8 valued logic (i.e. 3 bits for 
each word in the Input-value register) this phase requires 3 machine cycles. The 
input-value register bank can be implemented as a multi-ported memory system 
which allows several input values to be updated simultaneously provided that the 
values are located in different memory banks. Other logic values can be used. 

10 

The APPLES bit shift mechanism has made the role of a scheduler redundant. 
Furthermore, it enables the gate evaluation process to be executed in memory, 
thereby avoiding the traditional Von Neumann bottieneck. Each word pair in array 
1 b is effectively a processor. Major issues which cause a large overhead in other 
15 parallel logic simulation are "deadlock" and scheduling issues. 

Deadlock occurs in the Chandy-Misra algorithm due to two rules required for 
temporal correctness, an input waiting rule and an output waiting rule. Rule one is 
observed by the update mechanism of APPLES. For any time interval T^ to T,+^, all 

20 words in array 1b reflect the state of wires at time T^ and at the end of the 
evaluation and update process all wires have be updated to time T,+i. All wires have 
been incremented by the smallest timestamp, one discrete time unit. Thus at the 
start of every time interval all gates can be evaluated with confidence that the input 
values are correct. The Output rule is imposed to ensure that a signal values arrive 

25 for processing in non-decreasing timestamp order. This is guaranteed in APPLES, 
since all signal values maintain there temporal order in each word through the shift 
operation. Unlike the Chandy-Misra algorithm deadlock is impossible as every gate 
can be evaluated at each time interval. 

30 There is no scheduler in the APPLES system. Complex modelling such as Inertial 
delays have confronted schedulers with costly (timewise) unscheduling problems. 
Gates which have been scheduled to become active need to be de-scheduled 
when input signals are found to be less than some predefined minimum duration. 
This with the normal scheduling tasks contributes to an onerous overhead. 
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Fig. 2 displays the equivalent mechanism in APPLES. An AND gate has two Inputs 
a and b, assume that unless signals are at least of three units duration no effect 
occurs at the output, the simulation involves only binary values 0 and 1 and each 
5 bit in Arrayl b represents one time unit. Signal b is constant at value 1 , while signal 
a is at logic 1 for two time units, less than the minimum time. This will be detected 
by the parallel search generated by the input and mask register combination and 
the gate will not become active. 

10 The circuit is now ready to be simulated by APPLES and is parsed to generate the 
gate type and delay model and topology information required to initialise associative 
arrays 1 a, 1 b and the fan-out vector tables. There is no limit on the number of fan- 
out gates. 

15 The APPLES processor assumes that the circuit to be simulated has been 
translated into an equivalent circuit composed solely of 2-input logic gates. Thus, 
every gate has two wires leading into it (an inverter has two wires from one source). 
These wires are organised as adjacent words in associative array 1b 1 called a 
word set. Associative array la 1 contains identifiers from every wire indicated the 

20 type of gate and input into which the wire is connected. The identifiers are in an 
associative memory that when a particular gate evaluation test is executed, putting 
the relevant bit patterns into input-regia and mask-reg1a specifies the gate type. 
All wires connected to such gates will be identified by a parallel search on 
associative arrayl a and these will be used to activate the appropriate words in 

25 associative arrayl b (word-line register bank). Thus, gate evaluation tests will only 
be active on the relevant word sets. 

The input-value register bank 17 contains the current input value for each wire. 
The three leftmost bits of every word in associative array 1b are shifted from this 
30 bank in parallel when all signal values are being updated by one time unit. During 
the update phase of the simulation, fan-out wires of active gates are identified and 
the corresponding words in the Input-value register bank amended. 

Simulation progresses in discrete time units. For any time interval, each gate type is 
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evaluated by applying tests on associative array 1b and combining and recording 
results in tlie neighbouring register banl<s. Regardless of the number of gates to be 
evaluated this process occupies between 10 machine cycles for ttie simplest, to 20 
machine cycles for the more complex gate delay models, see Fig. 3. Once the fan- 
5 out gate inputs have been amended, all wires are time incremented through a 
parallel shift operation of 3 machine cycle duration. In general, for 2** valued logic N 
shift operations are required to update all signal values. 

Fig. 3 illustrates a simulation cycle. In the simulation cycle, the task particularly 
10 affected by the circuit size is that of scanning the hit list. As a circuit grows in size 
the list and sequential scan time expand proportionately. Analogous to the 
conventional communication overhead problem, the APPLES architecture 
incorporates a scan mechanism which can effectively increase the scan rate as the 
hit list expands. Thus, there is provided a multiple scan register structure. As will 
15 be described, one of the features of the present invention Is the parallelisation of 
the application of test vectors in the gate evaluation phase as will be described 
hereinafter. Similarly, Fig. 4 is a search test pattern for an AND gate. 

The series of signal values that appear on a wire over a period of discrete time units 
20 can be represented as a sequence of numbers. For example, in a binary system if a 
wire has a series of logic values, 1 ,1 ,0 applied to it at times to. t, and tg, respectively, 
where to< ti<t2. The history of signal values on this wire can be denoted as a bit 
sequence 011; the further left the bit position, the more recent the value appeared 
on the wire. 

25 

Different delay models involve signal values over various time intervals. In any 
model, signal values stored in a word which are irrelevant are masked out of the 
search pattern. 

30 The process of updating the signal values of a particular wire is achieved by shifting 
right by one time unit all values and positioning the current value into the leftmost 
position. Associative arraylb can shift right all its words in unison. The new cun-ent 
values are shifted into associative arraylb from the Input-value register bank. 
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Referring to Fig. 4, there is illustrated the parallel search patterns for an AND gate 
transition to logic "0". 

With wire signal values represented as bit sequences in associative memory words, 
5 the task of gate evaluations can be executed as a sequence of parallel pattern 
searches. Figure 4 depicts the situation where 8-valued logic has been employed 
and the AND gate has been arbitrarily modelled as having a 1 unit delay. 

Any gate which has any input satisfying and no{none) input satisfying T2 will 
1 0 transition to 0. 

Consequently, to determine If the output of this gate is going to transition from logic 
1 to logic 0 it is necessary to l<now the signal values at the current time and tc-i- 
The current values are contained in the leftmost three bits of the word set. Figure 4 
15 declares the current values on the two inputs as logic 1=111' and logic O='00ff and 
the previous values as both logic 1 . 

To ascertain If this AND gate has an output transition to logic Ojv.two simple bit 
pattern tests will suffice. If ANY current input value is logic 0 (Test T^) and NONE of 

20 the previous input values are logic 0 {Test T^), then the output will change to logic 
0. These are the only conditions for this delay model, which will effect this 
transition. With associative memory any portion of a word can be active or passive 
in a search. Thus, putting '000' and *1 1 1 ' into the leftmost three bits of the search 
and mask registers of associative array lb can execute test Ti. Test T2 can be 

25 executed by essentially the same test on the next leftmost three bit positions. 

In general each test is applied one at a time. The result of test T, on wordj is stored 
in the i^^ bit position of wordj in the test-result register bank 4. A *1' indicates a 
successful test outcome. For each word set, for every test it is necessary to know if 
30 ANY or BOTH or NONE of the inputs passed the particular test. If the i^ bits of 
wordj and wordj.i in the test-result register bank are Ored togeUier and the result of 
this operation is '1', then at least one Input in the corresponding word set passed 
the test Ti — the ANY condition test. If the result of the operation is *0' then no Inputs 
passed test T, - the NONE condition test. Finally, if the bits are Anded together 
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and the result is '1 ' then BOTH have passed test T^. 

The result^activator register 14 combines results which are subsequently 
ascertained by the group-result register. The logical interaction is shown in Fig. 5. 

5 

The And or Or operations between the bit positions is dictated by the result 
activator register. A '0' in the i^ bit position of tiie result activator register perfornns 
an Or action on the results of test T, for each word set in the tes$-result register 
bank and conversely a '1' an And action. Each i^ And or Or operation is enacted in 
1 0 parallel through all word set Test result register pairs. 

The results of the activity of the result activator register on each word set Test 
result register pair are saved In an associated group result register. Apart from 
retaining the results for a particular word set the group result registers are 
15 composite elements in an associative array. This facilitates a parallel search for a 
particular result pattern and thus identifies all active gates. These gates are 
identified as hits (of the search in the group result register bank) in the group-test 
hit list. 

20 Retuming to the AND gate transition to logic *0' example, an AND gate will be 
identified as fulfilling the test requisites, any input passes test T^ and none passing 
test Tg, if its corresponding group result register has the bit sequence *10* in the first 
two bit positions. 

25 The APPLE components involved in the gate evaluation phase and their 
sequencing are shown in Fig. 6. 

With the present invention, one of the major features of the method is the storing of 
each line signal to a target logic gate as a plurality of bits, each representing a 
30 delay of one time period. The aggregate bits will allow the signal output to and 
reception by the target logic gate to be accurately expressed. Thus, these are 
represented in the same manner as the inherent delay of each logic gate. What 
must be appreciated now is that as the speed of circuits increases, the time taken 
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to transmit a message between two logic gates can be considerable. Thus, the 
lines, as well as the logic gates, have to be considered as logic entities. 

Some logic entities may have delays which exceed the time frame representable in 
5 the word of associative array 1 b. Larger delays can be modelled by associating a 
state with a gate type, in this case a gate and its state are defined In associative 
array 1 a. Tests are performed on associative array 1 b and when a gate with a 
given state passes some input value critique, in addition to the fan-out components 
of the gate possibly being affected, the Gate state is amended in Associative array 
10 la. This new state may also cause a new output value to be ascribed to the fan-out 
list of the gate. The tests that are applied are determined by the gate type and 
state. In this mechanism the fan-Array 1 a of the gate itself. 

In order to determine whether the state or the state and the fan-out gates are to be 
15 updated the state( a binary value) can serve as a selector of the gate's fan-out 
update data files. The state amends the access point relative to the start location of a 
gates data files and this enables the gates normal fan-out list to be bypassed or not. 

On commencement of filling a new time frame (a word in associative array 1b), a 
20 special symbol is inserted into the left-most{most recent time) position. This symbol 
conveys the Input value on the gate and serves as a marker. When the marker 
reaches the right-most position in the word, this indicates that a complete time 
frame has passed. This can be detected by the normal parallel test-pattern search 
technique on associative array lb (See Figure 1). 

25 

The interconnect between logic entities being simulated can be modelled using the 
large delay model described above. Furthermore, single wires can be modelied by 
one word instead of two in associative array 1a , associative array lb and the test- 
result register bank. Branch points are modelled as separate wires permitting 
30 different branch points to have different delay characteristics. 

In effect, what is done is each delay is stored as a delay word in an associative 
memory forming part of the associative memory mechanism. The length of the 
delay word is ascertained and if the delay word width exceeds the associative 
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register word width, then it cannot be stored in the register simply. Then, the 
number of integer multiples of the register word width contained wthin the delay 
word is calculated as a gate state. This gate state is stored in a further state 
register, in effect, the associative register or associative array 1a. The remainder 
5 from the calculation is stored in the associative register array 1 b with those delay 
words whose width did not exceed the associative register width as well as with 
those words who did. Then, on the count of the associative register 1 6 
commencing, the state register is consulted, that is to say, the associative register 
1a, and the delay word entered into the register. The remainder is ignored for this 
1 0 count of the associative register array 1 b. At the end of the count of the 

associative register lb, the associative register la is updated by decrementing one 
unit If this still does not allow the count to take place, the process is repeated. If, 
however, the associative register la is cleared, then the count continues and the 
remainder now represents the count required. 

15 

Complex delay models such as inertial delays require conventional sequential and 
parallel logic simulators to unschedule events when some timjng critique is 
violated. This expends an extremely time consuming search through an event list. 
In the present invention, inertial delays only require verification that signals are at 
20 least some minimum time width; implementable as a single pattern search. 

An ambiguous delay is more complicated where the statistical behaviour of a gate 
conveys an uncertainty in the output. A gate output acquires an unknown value 
between some parameters tm,n (M time units) and t,^ax (N time units). Using 4- 
25 valued logic, APPLES detects an initial output change to the unknown value at 
time tm,n, followed by the transition from unknown value to logic state '0' at time t^ax. 
see Fig, 7. Hazard conditions, where both inputs simultaneously switch to 
converse values can also be detected, which is illustrated in Fig. 7, 

30 For each gate type, the evaluation time Tg^te^^^, remains constant, typically ranging 
from 10 to 20 machine cycles. The time to scan the hit list depends on its length 
and the number of registers employed in the scan. N scan registers can divide a 
Hit list of H locations into N equal partitions of size H/N. Assuming a location can 
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be scanned in 1 machine cycle, the scan time, Tscan is H/N cycles. Likewise it will 
be assumed that 1 cycle will be sufficient to make 1 fan-out update. 

For one scan register partition, the number of updates is (Probhrt)H/N, If all N 
partitions update without interference from other partitions this also represents the 
total update time for the entire system. However, while one fan-out Is being 
updated, other registers continue to scan and hits in these partitions may have to 
wait and queue. The probability of this happening increases with the number of 
partitions and is given by ^Ci(Probhit)H/N. 

A clash occurs when two or more registers simultaneously detect a hit and attempt 
to access the single ported fan-out memory. In these circumstances, a semaphore 
arbitrarily authorises waiting registers accesses to memory. The number of 
clashes during a scan is, 

No. clashes = (Prob of 2 hits per inspection) x H/N 
+ Higher order probabilities. 

(1) 

The low activity rate of circuits (typically 1%-5% of the total gate count) implies that 
higher order probabilities can be ignored. Assume a uniform random disfribution of 
hits and let Prob^rt be the probability that the register will encounter a hit on an 
inspection. Then (1 ) becomes, 

No. clashes = ""C^ (Probh^)^ x H/N 

(2) 

Thus, Tn, the average total time required to scan and update the fan-out lists of a 
partition for a particular gate type is, 

Tn = Tgate-eval Tscan Tupdate Tgigsh 

= Tg«.-e«a.+ H/N + ^C, (ProbJH/N + "C^, (ProbJ^ x H/N 

(3) 

Since all partitions are scanned in parallel, also corresponds to the processing 
time for an N scan register system. Thus, the speedup Sp=T,/Tn, of such as system 
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is. 



"^l/^N^ Tgatc-eval '^scan "^update 



Tga^^ai + H/N + (Prob JH/N + (Prob x H/N 

(4) 

10 

Eqt (4) has been validated empirically. Predicted results are within 20% of 
S observed for sample circuits C7552 and C2670 and 30% for C1908. Non- 

uniformity of hit distribution appears to be the cause for this deviation. 



15 Differentiating Tm w.r.t N and ignoring 2""^ order and higher powers of Probhtt the 
optimum number of scan registers t^opwmm and corresponding optimum speedup 
Sopt«ium is given by, 



20 



Nopt^um^(V2)/Probhu (5) 
So„ s1/{2.4xProb,ft) (6) 



Thus, the optimum number of scan registers is determined inversely by the 
probability of a hit being encountered in the Hit list. In APPLES, the important 
25 processing metric is the rate at which gates can be evaluated and their fan-out lists 
updated . As the probability of a hit increases there will be a reciprocal increase in 
the rate at which gates are updated. Circuits under simulation which happen to 
exhibit higher hit rates will have a higher update rate. 



30 When the average fan-out time is not one cycle, Prob^rt is multiplied by Pout, where 
Pout is the effective average fan-out time. 

A higher hit rate can also be accomplished through the introduction of extra 
registers. An increase in registers increases the hit rate and the number of clashes. 
35 The increase halts when the hit rate equals the fan-out update rate, this occurs at 
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Noptimum. This situation is analogous to a saturated pipeline. Further increases in the 
number of registers serves to only increase the number of clashes and waiting lists 
of those registers attempting to update fan-out lists. 



10 



Further simulations were carried out, again with a Verilog model of APPLES 
simulated 4 ISCAS-85 benchmarks. C7552(4392 gates). 02670(1736 gates). 
01908(1286 gates). 0880(622 gates) using a unit delay model. Each was 
exercised with 10 random input vectors over a time period ranging from 1,000 to 
10,000 machine cycles. Statistics were gathered as the number of scan registers 
varied from 1 to 50. The speedup relative to the number of scan registers is shown 
in Table 1 . 



No. Scan Registers 



C7552 
C2670 
CI 908 
C880 



15 

12.5 
9.7 
8.4 
7.8 



30 

19.9 
13.8 
10.8 
8.3 



50 
24,3 
15.9 
11.8 
9.7 



Speedup 



No. Scan Registers 

50 

29.6 
25.1 



15 
13.6 
12.5 
11.8 
11.1 



30 

24.3 
20.0 
17.3 20.9 
12.6 15.9 



Speedup(excl Fixed size 
Overheads) 



15 



(a) 



(b) 



Table 1 . Speedup Performance of Benchmarks 



Table (1 .a) demonstrates that in general the speedup increases with the number of 
scan registers. The fixed sized overheads of gate evaluation, shifting inputs etc, 

20 tends to penalise the performance for the smaller circuits with a large number of 
registers. A more balanced analysis is obtained by factoring out all fixed time 
overheads in the simulation results. This reflects the performance of realistic, large 
circuits where the fixed overheads will be negligible to the scan time. Table (1 .b) 
details the results with this correction. As expected this correction has lesser affect 

25 on the larger bench mark circuits. 
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Av, No. Cycles/Gate I*rocesse<i 
No. Scan Registers 





1 


15 


30 


50 


C7552 


154.6 


113 


6.4 


5.2 




101.9 


8.0 


5.1 


3.9 




86.9 


6-8 


5.1 


3.9 


crsso 


49.9 


4.9 


4.2 


3.6 



Table 2. Average No. of machine cycles per gate processed 

Taking the corrected simulated performance statistics, Table (2) displays tlie 
average number of machine cycles expended to process a gate. The APPLES 
system detects intrinsically only active gates, no futile updates or processing is 
executed. The data takes into account frie scan time between hits and the time to 
update the fan-out lists. As more registers are introduced the time between hits 
reduces and the gate update rate increases. Clashes happen and active gates are 
effectively queued in a fan-out/update pipeline. The speedup saturates when the 
fan-out/update rate, govemed by the size of the average fan-out list, equals the rate 
at which they enter the pipeline. 

The benchmark performance of the circuits also pemiite an assessment of the 
15 validity of the theory for the speedup. From the speedup measuremente in 
Table1.{b) the corresponding value for f,^ was calculated using Eqt(7). This value 
representing the average fan-out update time In machine cycles, should be constant 
regardless of the number of scan registers. Furthermore, for the evaluated 
benchmarks the fan-out ranged from 0 to 3 gates and the probability of a hit, Probh«, 
20 was found to be 0.01 ± 5%. Within one and a half clock cycles it is possible to 
update 2 fan-out gates, therefore depending on the circuit fav should be in the range 
0.5 to 1 .5. The calculated values f^^ for are shown in Table 3. 
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No. Scan 








Registers 








15 


30 


50 




Av 






C7552 


0.41 


0.35 


0.88 




0.55 






C2670 


0.52 


0.79 


1.26 




0.86 






C1908 


0.77 


1.21 


1.32 




1.10 






C880 


0.16 


1.98 


1.54 




1.22 







Table 3. The Average Fan-out Update Time (in machine cycles) for the 
Benchmarks 



The values for fgv are in accord with the range expected for the fan-out of these 
circuits. The fluctuations in value across a row for fav, where it should be constant 
are possibly due to ftie relatively small number of samples and size of circuits, 
where a small perturbation in the distribution of hits in tiie hrt-iist can affect 
significantly the speedup figures. In the case of C880, a 10% drop in speedup can 
effectively lead to a ten-fold increase in fg^ 




For comparison purposes Table 4 uses data from Banerjee: Parallel Algorithms for 
15 VLSI Computer-Aided Design. Prentice-Hall, 1994 which illustrates the speedup 
performance on various parallel architectures for circuits of similar size to those 
used in this paper. This indicates that APPLES consistently offers higher speedup. 



For comparison purposes Table 4 uses data from Banerjee: Parallel Algorithms for 
20 VLSI Computer-Aided Design. Prentice-Hall, 1994 which illustrates the speedup 
performance on various parallel architectures for circuits of simllar^ size to those 
used in this paper. This indicates that APPLES consistently offers higher speedup. 
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Architecture 



Synchronous 
Shared Distributed 
Memory Memory 



Asynchronous 

Shared Distributed 
Memory Memory 



10 



15 



20 



Circuit 



Multiplier (4990 gates) 5,0/8 / 

H-FRISC (5060 gates) 3.7/8 / 

SI 5850 (9772 gates) / 3.2/8 

SI 3207 (7951 gates) / 3.2/8 

Adder (400 gates) / / 

QRS (1000 gates) / / 



5.0/8,5.8,14 
7.0/8, 8.2/14 

I 

t 

4.5/16. 6.5/32 
5.0/16, 7.0/32 



Speedup Performance for Various Parallel Systems 

Notation a/b, where a = Speedup value, b = No. Processors. 

Double entries denote two different systems of the same architecture 

TABLE 4 - A Speedup comparison of other parallel architectures 

The following from pages 28 to 54 is one example of an implementation of the 
present invention in software written in Verilog. 
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Veriiog Description of APPLES 



Associative Arrayla 

Description: Each word of this array holds a bit sequence identifying the gate type input 
connection of a wire, in the corresponding position in Associative Arraylb. The input/mask 
register combination defines a gate type that will be activated for searching in Associative 
Arrayla* Words that successfully match are indicated in a 1-bit column register. The array also 
has write capabilties. 

module Ary_la ( Inpu t_regla , Mask_regla , Adr_regla , Clock , 
Sear ch_enbl la , Wr i t e_enbl la , Ac t i v_l s tl a ) ; 

// Znput_aregla , Hask.rftgla, Adr^regXa are the Xnput:,Mask and Address registers 
of Associative Arrayla. 

When Soarch^anblla is set, the negative edge of Clock initiates a parallel 
search . 

Activ_lstla is a coluim register that indicates those words in Associative 
Arrayla which compared successfully with the search pattern. // 

parameter Ary_la_wdth=7; 
paraineter Aryla_si2e=16383 ; 
integer Ary_index? 

input Clock, Sear ch_enbl la , Write_enblia ; 

input [Ary_la_wdth: 0] Input_regla, Mask_regla, Adr_regla; 

output [ Aryla_size : 0 1 Activ_lstla ? 
reg (Aryla^size : 0] Activ^lstla; 

reg ( Ary_la_wdth : Q ) Aryla_ass_inem [ 0 : Aryla_s ize 3 , Temper eg ; 

initial 
begin 

$readmeRib ( " Aryla . dat " , Aryla_ass_inem) ; 

// Aryla.dat is the data file defining the gate and model types in the circuit.// 

for ( Ary_index= 0 ; Ary_index< = Ary 1 a_s i z e ; Ary_index=Ary_index-*- 1 ) 
begin 

Ac tiv_lstla [ Ary_index] =0 ; 
end 

end 

always @(negedge Clock) 
begin 

if (Search_enblla) 
begin ' 

for (Ary_index=0 ; Ary__index<=Aryla_size; Ary_index=Ary_index+l) 
begin 

Temp_reg=Aryla_ass_mera ( Ary^index] ; 

if { (-Mask^regla j (Input_regla Sc Temp^reg) | 

( -Xnput_regla & -Temper eg ) ) ==8 * hf f ) 
Activ_lstla [ Ary_indexl =1 ; 
else 
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Activ_istla[Ary_index] =0 ; 

end 
end 

if (Write_enblla) Aryla_ass_meni[ Adr_regla] = Input_regla; 
end 

endmodule 

Associative Array lb 

Description: Every word in this array represents the temporal spread of signal values on a 
specific wire. The most recent valaes being leftmost in each word. All words can be 
simultaneously shifted right, effecting a one unit time increment on all wires. The signal values 
are updated from a 1-bit colunm register. The array has parallel search and read and write 
capabilities. 



module Ary.lb ( Search_reglb , Mask_reglb, Adr_reglb, Datain_reglb. 

Dataout^reglb, Hit_buf f r_reglb, Shf t_eiibl , Search_enbllb, 
Write^enbi , Read_enbl , Clock, Input_bit , 
Word_line_enbl ) ; 

// Soaircii_reglb, Hask_reglb, Adr^reglb, Dataia_reglb,Dataoiit_reglb are the 

Search, Mask, Address , Data - in and data-out registers of Associative Arraylb. 
When Saeir'Clx_exibllb is set, the negative edge of Clock initiates a parallel 
search. Likewise, a read or write operation is executed on the negative edge at 
the clock if Write_onbl or Road_enbl is asserted. 

A parallel search is initiated on a negative edge of the Clock if Soarcb^enbllb is 
set. This search is only active on those words that are primed for searching by 
the Word_lxii«_exjbl colwm regsiter. The bits in this register are set/cleared by 
Activ_lstla of Associative Arrayla. This effectively selects gates of a certain 
gate type and delay model. Words that match are identified by bit being set in the 
corresponding position in Hit_buf far^raglb . 

Words are shifted right in parallel with the leftmost bit being taken from 
Inpttt^bit . / / 



parameter Arylb_Tneni__size=163 83 ; 
parameter Wlr_wrdsize =31; 
parameter Shf t_dly=2 ; 
parameter Adr_reg_bits=13 ; 

input 1 Wlr_wrdsi ze : 0 ] Search_reglb , Mask_reglb , Datain_r eglb ; 
input [ Arylb__mem__size : 0 ] Input_bit , Word_line_enbl ; 



input Clock; 

input Shf t_enbl , Search_enbllb . Wr i te_enbl , Read_enbl ; 



reg [Wlr_wrdsize : 0 ] Temp_regl ; 

reg [Wlr_wrdsize: 0] Wlr_Ass_mem{ 0 : Arylb_mem_size] ; 



input C Adr_reg_bi ts : 0 1 Adr_reglb ; 



output lArylb_mem_size : 0] Hit_buf f r_reglb; 
reg" [Arylb_mem_size: 0] Hit_buf fr^reglb^- 
output [Wlr_wrdsize : 0 ] Dataout_reglb; 
reg [Wlr_wrdsxze : 0 1 Dataout_reglb ; 
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integer >ienv_indx; 

initial $readinemb ( "Array lb . dat " > Wir_Ass_raem) ; 

//Arraylb-dat is the file which initialises all the words in Arrraylb to the 
Unknown value.// 

always ©(negedge Clock) 
begin 

if (Shft.enbl) 

^^for (Mem_indx=0; Mem_indx<=Arylb jnem^size ; Mein_indx= Mem.indx + 1) 
begin 

Tenqp^regl = Wlr_Ass_inem[MenL.indx] ; 
Ten^_regl= Temp_regl » 1; 

Temper egl[Wlr_wrdsize] = Input_bit [Mem_indx] ; 
Wlr_Ass_niem(Mein_indx] = Teinp_regl; 
end 

end 
else 

if (Search^enbllb) 

for (Mertuindx=0; Mem„indx<=Arylb_mei!Usize ; Mem_indx = Mem_indx + 1) 
begin 

if ( Word_line_enbl [Mem^indx] ) 
begin 

Ten^^regl = Wlr_Ass_iiieni [Mem_indx3 ; 

if ( (-Mask_reglb I { Search.reglb & Teinp_regl) } 

(^Search^reglb & -Ten^J^regl) > ==32 'hf f f f f f ff ) 

begin 

Hit_buf fr_reglblMenv_indx3 = 1? 

end 

else 

begin 

Hit__buf fr_reglblMein_indx] - 0; 
end 

end 
else 

Hit_buffr_regibtMem_indxl = 0; 
end 

end 
else 

if (Write_enbl) 

Wlr_Ass_mem[Adr_reglbl = Datain_reglb ; 

else 

if (Read^enbl) 

Dataout_reglb = Wlr_Ass_mem[Adr_reglb] ; 



end 
endmodule 



wo 01/01298 



-31- 



PeT/IEOO/00083 



Test-result register Bank 

Description: When an i*** search is executed on Associative Arraylb, if wordj in Arraylb matches 
the search pattern, then biti in wordj of the Test-result register bank wUl be set, otherwise it is 
cleared. The Result-activator register specifies the logical combination between pairs of words( a 
gate's set of inputs). The result of this combination of word pairs is a colunm register (half the 
length of the number of word pairs). 

module Tst_rslt_reg_bank{Inp_buf fr_r eg, Trr_wrt_enbl,Conib_enbl, Clock, 

Out_buf f r_reg , Rslt_act_reg , Wri tempos , Rset ) ; 

// lnp_bttffr_ro0 is a column of bits describing the outcome of a search on each 
word in Arraylb. This bit column is written into a colunin of the Test-result 
register bank on the negative edge of Clock when Trr^wrt_anbl is asserted. The 
position of this coulmn is defined by Write^pos. , 
Word pairs are combined according to the bit sequence in Rslt^act^reg , A 0 xn 
biti of Rslt.act^reg ORs the.i^ bits in each word pair and produces the result for 
each pair in Out: Jnxf fr^r-eff . This combination is executed on the negative edge of 
Clock when Coinb_onbl is asserted. R««t resets all the bits in the Test-result 
register bank.// 

parameter Trr_word_size=7 ; 
paraoneter Trr_mein_si2e=163 83 ; 
parameter Trr_out_size=8191 ; 
parameter Trr_wd.th_spec=2 ; 

reg lTrr_word_size : 0 ] Trr_array [ 0 :Trrimem_si2el ; 
reg [Trr_word__size : 0 ] Terap^regl , Temp_reg2 ; 
reg Ralt_action; 



input [Trr_m€m_sizeiO] Inp_buf f r_reg? 

input ETrr_word_s i ze : 0 ] Rs 1 t_ac t_r eg ; 

input [Trr_wdth_spec:01 Write j>os; 

input Clock; 

input Trr_wrt_enbl ; 

input Comb_enbl; 

input Rset; 



output [Trr_out_si2e : 0] Out_buf f r_reg; 
reg [ Trr_out_si2e : 0 ] Out_buf f r_reg; 

integer Bank_index , i ; 



always @(negedge Clock) 
begin 

if ( Trr_wr t_enbl ) 

for^^Bank_index=0? Bank_index<=Trr_inem_size; Bank_index=Bank„index+l) 
begin 

Temp_regl=Trr_array [Bank_index] ; 

Temp_regl [Write_posl =Inp_buf f r_reg(Bank_index) ; 

Trr_array lBank_indexl =:Temp„regl ; 

end 
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end 
else 



i f (Comb^enbl ) 
begin 

Rslt_ac t ion=Rsl t_act_reg (Wr ite^os 1 ; 
for (i=0; i«Trr_word_si2e; i=i+l) 
begin 

for ( BanK_index=0 ; Bank_index<Trr_meiiusize ; Bank_index=Bank_index+2 ) 
begin 

Temp_regl-Trr_array (Bank_index3 ; 
Terop_reg2=Trr_array [Bank_index+1] ; 
if (Rslt_act:ion==0) 

Out^buf fr_regtBank_index/21 = (Temp.regl [Write_pos] | 

Terop_reg2 [Write^pos] ) ; 

else 

Out_buf f r_reg CBank_index/2] =Tenip_regl [Write_pos] & 

Temper eg2 [Wri tempos] ; 

end 
end 



end 



else 



if (Rset) 

f or C Bank_index=0 ; Bank_index<=Trr_mem_size ; Bank_index=Bank_index+ 1 ) 
Trr_array I Bank^index) =8 ' hOO ; 

end 



end 
en^odule 
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Group-resilit register Bank 

Description: The result of the combination of word pairs in the Test-result register is written as 
a column of bits into the Group-result register bank. When all combination results have been 
generated a parallel search is executed on ttie Group-result register to ascertain all word pairs in 
Arraylb that passed all the test pattern searches* 

module Grp_rslt_reg_bank(Grr_inp_regr Grr_inask_reg, Grr_srch_reg, 

Clock , Srch_enbl , Wrt_enbl , Wri tempos , 
Grr_hit_list) ; 

// Grr_inp_reg is shifted as a bit column into a column of the Group-result 
register bank defined by Write^pos . This column write operation is activated on 
the negative edge of Clock when Wrt_enbl is asserted . 

Grr_iaask_reg and Gri:_srclx_rog compose a search pattern enacted on the negative 
edge of Clock when Srch_onl>l is set. Pattern matches are indicated in 
Orr_hit_lxst . The GM_liit:_list is also known as the Group- t:ast Hit list.// 

parameter Grr_memu-size=8191 ; 
parame t er Gr r_word_s i ze =7 ; 
parameter Grr_wdth_spec=2 ; 

input [ Grr_mem_s i 2 e : 0 ] Grr_inp_r eg ; 

input [Grr__word_size : 0] Grr_mask_reg,Grr_srch_reg; 

input (Grr_wdth_spec:0] Write_pos ; 

input Clock, Srch_enbl,Wr t_enbl; 

output [Grr_mem_size : 0 1 Grr_hit_lis t ; 
reg lGrr_mem_size : 0) Grr_hit_list ; 

reg tOrr_word_s i ze : 0 ] Grr_ar ray ( 0 : Grr_mem_s i ze J ; 
reg iGrr_word_size ; 0] Ten^_reg; 

integer Bank_index; 

always @ {negedge Clock) 

if (Wrt_enbl) 
begin 

for (Bcuik_index=0 ; Bank_index<=Grr_mem_size; 

Bank_index=Bank_index + 1) 

begin 

Temp_reg= Grr_.array [Bank_indexl ; 

Temp_reg [ VJr ite_jpos 1 = Grr_inp_reg [Bank_index] ; 

Grr_ar r ay [ Bank_index 1 =Temp_r eg ; 
end 
end 

else if (Srch„enbl) 

for (Bank„index=0 ; Bank_index<=Grr_mem_size ; 

Bank_index==Bcink_index+l ) 

begin 

Temper eg = Gr r _arr ay [ Bank_ index] ; 

if ( (-Grr^mask^reg i (Grr_srch_reg & Temp^reg) | 
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{ -Grr_srch__reg & -Teinp_reg ) ) ==8 ' hf f ) 
Grr_hit_list [Bank_indexl = 1; 
else 

Grr_hit_list [Bank_index] = 0; 

end 

endinodule 

Multiple-response resolver (Version 1.0 Single Scan mode) 

Description: The Multiple-response resolver scans the Group-test Hit list ( a 1-bit column 
register). The resolver commences a scan by initialising its counter with the top address of the 
Hit list. This counter serves as an address register which facilitates reading of every Hit list bit 
If the inspected bit is set, tiie fan-out list of the associated gate is accessed and updated 
appropriately. The bit is then reset After reset or if the bit was already zero, the counter is 
decremented to point to the next address in the Hit list. The inspection process is repeated. The 
scanning terminates either when all bits have been inspected or all bits are zero. 

module Multiple_res_res (Grr_hit_list, Clock, 

Reset_c tr , End_scan_f lag , Decrmt_enbl , 
Fan_out_src_reg , Fan_out_size__reg , Rset_hit_f iid_f Ig, 
Hit_fnd_f lag) ; 

/ / The Multiple_response_resolver inspects a new bit of Cinr_iiit_list on the 
negative edge of Clock while I>ecrmt_«nbl is asserted. R«aet_ctr loads the 
resolver' s counter with top location of Hit list. If the current inspected bit is 
set, Hit_fnd_flag^ is asserted and the vector and the size (no. of gates) for the 
fan-out list loaded into Fan_ottt_src_r*a and Faa_out_si«e_res, respectively. 
Scanning halts and only recommences on the positive edge of R«et_Kit_fnd_f Ig which 
is externally controlled. Scanning terminates when all bits have been inspected or 
reset to zero. This condition is indicated by Knd_aean_f lag*. // 



parameter Grr_jnem_size=8191 ; 
parameter Vectr_tbl_adr_regjbits~13 ; 
parameter Fanout_Jhdr_tbl__wdth=13 ? 
parameter Max_f an_out=7 ; 
parameter Inp_bnk_s i ze=l 63 8 3 ; 



input Reset_ctr , Rset_hit_f nd_f Ig , Clock; 
input [Grr_mem_size : 0 1 Grr_hit_list ; 

input Decrmt_enbi; 

output End„scan_f lag; 
reg End_scan_f lag; 

output Hit_fnd_f lag; 
reg Hit_fnd_f lag; 

output Fan_out_src_reg; 

reg 1 Vectr_tbl_adr__reg_bi ts : 0 } Fan_out_src_reg ; 

output Fan_out_s i ze_reg ; 

reg [Max_f an_out : 0 1 Fan_ou t_s i ze_reg ; 

reg[Fanout_hdr_tbl_wdth:01 Fan_out.hdr_tbl [0 : Inp_bnk_size] ; 

reg [Vectr_tbl_aar_reg_bi ts : 0 1 Hit_lst_ctr ; 

reg[Max_fan_out : 01 Fan_out_size_tbl [0 :Inp_bnk_si2el ; 
reg lGrr_mem_size : 0 ] Hi t_lst_buf f r ; 



reg Hit„fnd_ORed_f Ig, Tst_or_bit ; 
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integer N\im_hits , Hi t_dist , Sum_hifc_dis t , Prev_hi t_lst_ctr , Avg_dis t ; 



initial Sreadmemh { " Fanout . dat " , Fan_out_hdr_tbl ) ; 

//The file Fanout. dat: contains the vectors for the start of the fan-out lists for 
every gate in the circuit being simulated.// 

initial $readniemh( " Fansize.dat ,Fan_out_size_tbl) ; 

//The file Fansize.dat. specifies the size of the fan-out list for each gate being 
siimilated.// 

initial forever 
begin 

&{Reset_ctr) 

if (Reset^ctr) 
begin 

Num_hits=0; 

Prev_hit_ls t_ctr=Grr_menL.si2e ; 

SunUii t_dis t = 0 ; 

Hit_lst_buf f r3:Grr_hit_lis t ; 

Tst„or_bit=t |Grr_hit_list ; 

$dispiay(-OR Check=%b- ,Tst_pr_bit) ; 

Hit_ls t_c tr=Grr_ineiii_s i ze ; 

End_scan_f lag=0 ; 

Hit_fnd__flag=0; 

Hit_fnd_ORed_f lg=l ; 

SdispiayC "Initialisation seq executed"); 
end 
end 



always @(negedge Clock) 
begin 

if ( (Decrmt_enbl) && (> End_scan_f lag) > 
begin 

Hit_f nd„ORed_f lg= | Hit__lst_buf f r ; 
if C {Hit_lst_ctr>0) ( Hit_fnd_ORed_f Ig) ) 

begin 

if (Hit_lst_buf f r [Hit_lst_ctr ] —1 ) 
begin 

Nuiri_hits=NuiiL_hits + 1; 

Hit_dist=Prev_hit_lst„ctr - Hit_lst_ctr; 
Suni_hi t _di s t =Hi t_di s t + Siim_hi t_di s t ; 

$display ( "Hit distance=%d" , Hit_dist , "Tiine=%d" , $tizne} ; 
Pr ev_hi t_l s t_c tr =Hi t_l s t_c tr ; 

Fan_out_si2e_reg=Fan_out_si2e_tbl [Hit„lst_ctrl ; 

Fan_out_s rc_reg=Fan_ou t_hdr_tbl [ Hit_ls t_c tr ] ; 

Hit_fnd_fiag=l; 

Hit_lst_buf fr (Hit_lst_ctr ] =0 ? 

end 

end 



if { (Hit_lst_ctr>0) (I Hit_fnd_ORed_f Ig) ) 

begin 
End_scan_f lag- 1 ; 

$display("No of hits in fan-out list=%d" ,Nuin__hits) ; 

Avg_dist=SuinJhit_dist/Nuin_hits ? 

Sdi splay ("Average hit distcuice=:%d" , Avg_dist ) ; 

end 



if (Hit_lst_ctr==0) 
begin 

if (Hit_lst_buf fr[Hit_lst_ctr]==l) 
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begin 

Num_h its =Nuin_hi t s + 1 ; 
nitidis t=Prev_hit_lst_ctr-Hit_lst_ctr; 
$display { "Hit distance=%d" , Hit_dist) ; 
Prev_hi t„l s t_c t r=Hi t_l s t_c tr ; 
Suni_hit_dis t=:Hit_dist+SiainJiit_dist ; 

Fan_out:_size_reg«Fan_out_size_tbl [Hit_lst_ctrl ; 
Fan_out_src_reg=Fan„out_hdr_tbllHit_lst_ctr] ; 
Hit_fnd_f lag:=lj 
end 

End_s c an_f 1 ag=l ; 

Sdisplay { "No of hits in fan-out list=%d" ,Nuiiuhits} ; 
Avg_dist=Siiiruiiit_dist/Nunv„hits ; 
$display( "Average hit distance=%d" , Avg.dist) ; 
end 

Hit_lst_ctr=Hit_lst_ctr -1; 
end 
end 

always @(posedge Rset_hit_fnd_f Ig) 
begin 

i^^ Hi t_f nd_f lag=0 ; 

end 

endmodule 
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Multipie.Response Resoiver (Version 2.0 Multiple Scan Mode) 

Description: The Multiple-response resoiver scans the Group-test Hit Ust ( a 1-bit column 
register). The resoiver in Multiple Scan Mode consists of several counter(scan) registers. Each is 
assigned an equal size portion of the Group-test Hit list. When the resoiver is initialised all scan 
registers point to the top of their respective Hit list segment. The registers are synchronised by a 
single clock. The external functionality of the Multiple Scan Mode resoiver is identical to that of 
the Single Scan Mode version. Internally, the Multiple Scan version uses a Wait semaphore to 
queue multiple accesses to the the fan-out lists- Registers which clash are queued arbitrarily and 
only recommence scanning after gaining permission to update Uieir fan-out lists. Scanning 
terminates when all bits have been inspected or all bits are zero. 

13 module Multiple__res!_res (Grr_hit_list , Clk, 

P Resen^ctr . End_scan^f lag, Decrmc_eiibl , 

|i Fan_outi_src_reg , Fan^out:_si2e_reg , Rset_hit_f nd_f Ig, 

Hit_fnd_f lag) ; 

M // The Multiple_response_resolver inspects in parallel several bits of 

t^l Grr_iiit„list on the negative edge of Clock while Decrmt^enbl is asserted, 

ihli Reset._ctr loads the resoiver' s scan registers with the top location of each 

,r respective segment of the Hit list. If any of the current inspected bits are set, 

L=i Hit_fnd_flag is asserted. The vector and the size (no. of gates) for the fan-out 
list of the segment which has been granted permission, is loaded into 

Hp Paii_ottt„src_reg and Faxi_out_si2e_reg, respectively. Scanning halts for all 

H registers awaiting permission. Permission is arbitrarily granted to a segment on 

i.O the positive edge of Rset_bit_faa_£lg which is externally controlled. For 

pi registers that have not found a hit, a new bit is inspected on the negative edge 

5r{ of Clock. Scanning terminates when all bits have been inspected or reset to zero, 

sy This condition is indicated by Ena_scan_f lag . / / 



parameter Grr_meni_size=8191 ; 
parameter Vectr_tbl_adr_reg_bits = 13 ; 
parameter Fanout_hdr_tbl_wdth=13 ; 
parameter Max_f an_out=7 ; 
parameter Inp_bnk_s i ze= 163 83; 

input Reset_ctr ,Rset_hit_fnd_f Ig, elk; 
input tGrr_mem_size:01 Grr_hit_list ; 

input Decrmt_enbl? 

output End_scan_f lag; 
reg End_scan_f lag; 

output Hit_fnd_f lag; 
reg Hit_f nd_f lag ; 

output Fan_out_src_reg; 

reg [ Vectr_tbl_adr_reg_bits : 0 ] Fan_out__src_reg; 

output Fan_out_size_reg; 

reg [Max_f an_out : 0 1 Fan_out_3ize_reg ; 

reg {Fanout_hdr_tbl_wdth : 0 ] Fan_out_hdr_tbl ( 0 : Inp_bnk._s i ze ] ; 
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reg (Max_f an_out : 0 3 Fan_ou t:_s ize_tbl [ 0 : Inp_bnk_s ize ] ; 
regtGrr_mein__size:0] Hit_lst_buf f r ; 

reg Hit:_f nd_ORed„f Ig , Tst_or_bit , Mpl_scan_enbl ; 

integer Numjtiits , NumLhits^ratio , Starti_t ime , Finish^time ; 

reg decrmt^enbll , decritit_enbl2 , decrmt:_enbl3 , decrmt_enbl 4 , menuaccess ; 
reg decnnt_enbl5 , decrmt_enbl 6 , decritit:_enbl7 , decrmt_enbl 8 ; 



reg decrmt_enbl2 5 , deci:mt_enbl26 , decnnt_enbl27 , decnnt_enbl28 ; 
reg decrmt_€nbl29 , decnnt:_enbl30 ; 

//These registers ezuibla a segment to be scasned when asserted. This progx-am 
assumes that the list is "divided into 30 equalled size segments.// 



integer cl , c2 , c3 , c4 , c5 , c6 , c7 » c8 ; 



integer c25 , c26 , c27 , c2 8 , c29 , c3 0 , Total ; 

reg [Vectr„tbl_adr_reg_bits : 0 ) posl ,pos2 , pos3 , po54, pos5 , pos6,pos7 , pos8 ; 
reg [ Vectr_tbl_adr_reg_bits : 0 ] pes 2 5 ,pos26 , pos27 , pos28 ,pos29t pos30 ; 
// These are the scan rasristars for each segment.// 



psurameter 


upr_ 


ltl= 


149; 


parameter 


lwr_ 


.ltl= 


0; 


parameter 


upr_ 


.lt2= 


299; 


parameter 


lwr_ 


.lt2s:^ 


ISO; 


parameter 


upr_ 


.lt3 = 


449; 


parameter 


lwr„ 


-lt3 = 


300; 


parameter 


upr_ 


.lt4= 


599; 


parameter 


lwr_ 


at4= 


450; 


parameter 


upr. 


.lt5= 


749; 


parameter 


Ivrr, 


at5= 


600; 


parameter 


upr_ 


at6= 


899; 


parameter 


Iwr, 


at6= 


750; 


parameter 


upr. 


at27: 


^ 4049; 


parameter 


Iwr. 


at27 


= 3900; 


parameter 


upr. 


at28 


= 4199; 


parameter 


Iwr, 


.lt28 


= 4050; 


parameter 


upr. 


lt29 


= 4349; 


parameter 


Iwr, 


_lt29 


= 4200; 


parameter 


upr. 


.lt3 0 


= 4392; 


parameter 


Iwr_lt3 0 


= 4350; 



// These parameters define the uppaxr and lower limits of the segments of the 
Group- test Hit list.// 

initial 
begin 

posl=upr_ltl ; 
pos2=upr_lt2 ; 
pos3=upr_lt3 ; 
pos4=upr_lt4 ; 
pos5=upr_lt5 ; 
pos6=upr_it6 ; 



pos27=upr_lt27 ; 
pos28=upr_lt28 ; 
pos29=upr_lt29 ; 
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pos30=upr_lt3 0; 

decrmt_enbl 1=1; 
decrnit_enbl2 = 1 ; 
decnnt_enbl3 =1 ; 
decnnt_enbl4=l ; 
decnat_€nbl5=l ; 
decrmt_enbl6=l ; 
decrmt_eiibl7=l ; 



dec3ntit_enbl2 7 = 1 ; 
decnnt_enbl28=l ; 
decnnt_enbl29=l ; 
decrnit_enbl3 0=1 ; 



cl=0; 
c2=0; 
c3=0; 
c4=0; 
c5=0; 
c6=0; 



c27=0; 
c2S=0; 
c29=0? 
c30=0; 

: mem_access=l; 
end 

initial $readiiiemh(" Fanout.dat " , Fan_out:_hdr_tl3l) ? 

//The file Pasout.dat; contains the vectors for the start of the fan-out lists tor 
every gate in the circuit being simulated.// 

initial $readnienih ( "Fansize-dat" , Fan_out_size_tbl) ; ^ 
//The file 7aulz*.dat specifies the size* of the fan-out list for each gate bexng 
simulated.// 

initial forever 
begin 

§ (Reset_ctr) 

if (Reset_ctr) 
begin 

NuirL_hits=0; 

Hit_lst_buf f r=Grr_hit_list ; 

Tst_or_bi t= I Grr_hi t_list ; 

Sdisplay ( "OR Check=%b" ,Tst_or_bit) ; 

End_scan_f lag=0 ; 

Hit_fnd_flag=0; 

Hit_fnd_ORed_f lg=l ; 

posl=upr_ltl ; 

pos2=upr_lt2 ; 

pos3=upr„lt3 ; 

pos4=upr_lt4 ; 

pos5=upr_lt5 ; 

pos6=upr_lt6 ; , 



pos27=upr_lt27 ; 
pos28=upr_lt2 8; 
pos29=upr_lt29 ; 
pos 3 0=upr_l 1 3 0 ; 

decrmt_enbll=l ; 
decnut_enbl2=l ; 
decrmt_enbl3 = l ; 
decnnt_enbl4=l ; 
decrnit_enbl5=l ; 
decntit_enbl6=l ; 
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dec rmt _enb 127=1; 
d€crmt„enbl2 8 = 1 ; 
decrmt_enbl29=l ; 
dec rmt _enbl 30=1; 



cl=0; 
c2=0; 
c3=0; 
c4=0; 
c5=0; 
c6=^0; 



c27=0; 
c28=0; 
c29=0; 
c30=0; 

mem_access=l ; 
ineni_access=l ; 

$display( "Initialisation seq executed"); 
Start„time=$ time ; 
end 
end 

always e (posedge Decrmt_enbl > 
begin 

Mpl_scan_enbl=l ; 

end 

always ©(posedge Rset_hit_fnd_f Ig) 
begin 

Hit_f nd_f lag=0 ; 
mem_access=l ; 
end 

always e (negedge Clk) 
begin 

if (! End_scan_flag) 
begin 

Hit„fnd_ORed_f lg= |Hit_lst_buf f r ; 

if { ! . Hit_fnd_ORed_f Ig) 
begin 

End_s can_f lag= 1 ; 
Mpl_scan_enbl=0 ; . 
end 
end 

if ( (Mpl_scan_enbl) && ( Hit_fnd_ORed_f Ig) ) 
begin 

if {decrmt_eiibll) 
begin 

if (Hit_lst_buf f r [posl] ==1) 
begin 

Hit_lst_buf f r [posl] =0 ; 
decrmt__enbll=0 ; 
if ( imenuaccess ) 

begin 

cl=cl+l; 

$display("Clashl cl=%d-,cl); 
end 

wait Cmem_ac cess) ; 
mem_access=0 ; 
Num_hits=N\iin_hits + 1; 

Fan_out_size_reg=Fan_out_size_tbl [posl] ; 
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Fan_out_src_reg=Fan_out_hdr_t:bl [posl] ; 
Hit^f nd_f lag=l ; 
Hit_lst_buf f r tposl] =0 ; 

if (posl >lwr_ltl) 
begin 

posl=posl-l; 
decnnt_enbl 1=1; 
end 

end 

else 
begin 

if (posl >lwr_ltl) 
begin 

posl==posl~l ; 
end 
else 
decrmt:_enbll=0 ; 

end 

end 



i f < decrnvt_enbl 3 0 ) 
begin 

if (Hit_lst_buf fr [pos301==^l) 
begin 

Hit_lst_buf fr [pos301=0; 
decnnt_enbl3 0=0 ; 
if ( imeni_access ) 

begin 

c30=c30+l; 

$display("Clash30 c30=%d- , c30) ? 
end 

wait (inenLiaccess) ; 
xnem.accesssO ; 

Nuiii_Jiits=Nuiiu>iits + 1; 

Fan_out_size_reg=Fan_out_size_tbl Cpos3 0] 
Fan_out_src_reg=Fan_out_hdr_tbl [pos301 ; 
Hit_f nd_f lag=l ; 
Hi t_ls t_buf f r [pos3 03=0; 

if {pos30 >lwr_lt30) 
begin 

pos30=pos3G-l; 
decrmt_enbl3 0=1 ; 
end 

end 

else 
begin 

if {pos30 >lwr_lt30) 
begin 

pos 3 0 =pos 3 0-1; 
end 
else 
decrmt_enbl3 0=0 ; 

end 

end 

end 
end 



always @ (pos edge End_scan_f lag) 
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begin 

F i ni sh__t iine= $ t ime ; 
end 
endmodule 



Fan-out C^enerator module 

Description: When a hit has been detected in the Group-test Hit list. The address within the scan 
register selects a vector {from the Fan-out hdr table) which locates the start of a fan-out list for 
the current active gate. The address register of this module is loaded with the address of the 
header of the fan-out list. The size of this fan-out list and the updated signal value to be 
transmitted is also conveyed to the module. The module proceeds to affect all changes in the fan- 
out lists. 



module Fan^out^gen (Fan_out.load, Fan_out_gen_f Ig, Reset^gen, Update_val_in, 

Clock, Update_val_out , Fan„out„size_reg , 
Fan_out_adr_reg,Out_adr_reg) ; 

//The address in Fan_out_vector_tbl of the header of the Fan-out list and the 
number of fan-out elements, are contained in Fan_out_adr_irag and Pan_out_sxz«„r03 
respectively. These are loaded on the positive edge of FM_out_load. On the 
successive negative edge(s) of Clock the address of a fan-out wire is generated m 
Oat aar^reg. The end of a fan-out list is indicated when Fan_out_goii_f Ig is set. 
Thii flag is cleared by the positive edge of R«set_g«n. The signal value to be 
conveyed to the fan-out list is transferred to and transmitted by the module in 
tlpdat«_val_in and UJpdata^val.out , respectively.// 

parameter Vectr_tbl_wrd_size = 13; 
parameter Vectr_tbl_si2e = 16383; 
parameter Inp_val_wdth=2 ; 
parameter Max_f cu:i_out=7 ; 
parameter Vectr_tbl_adr_size=13 ; 

input Fan_out_load , Reset_gen , Clock; 
input C lnp_val_wdth : 0 ] Update_val_in; 
input [ Max_f an_out : 0 1 Fan_out_s ize_reg ; 
input [ Vectr_tbl_adr_si2e : 0 1 Fan_out_adr_reg ; 

output Fan_ou t_gen_f 1 g ; 
reg Fan_out_gen_f Ig; 

output [ Inp_val_wdth : 0 3 Update_val_out ; 
reg [ Inp_val_wdth : 0 ] Update_val_out ; 

output [ Vectr_tbl_wrd_si2e : 0 1 Out_adr_reg ; 
reg [Vectr_tbl_wrd_size: 0] Out_adr_reg; 

reglVectr_tbl_wrd_size:01 Fan_out_vector_tbl ( 0 : Vectr_tbl_size] ; 
reg [Vectr_tbl_wrd_size : 0) List_pos; 
reg [ Max_f an_out : 0 3 Counter ; 



initial Sreadmeinh ("Fanvcr ,daf , Fan_out_vector_tbl) ; 

//Fanvcr-dat contains the vectors of the signals in the fan-out lists for every 
gate. // 



initial forever 
begin 
@ (Reset_gen) 
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if (Reset_gen) 
begin 

Fan_out_gen_f lg=0 ; 
end 

end 

always @(posedge Fan_out_load) 
begin 

if ( ! Rese t_gen ) 
begin 

Counter=Fan_out_size_r-eg ; 
List_pos=Fan_out_adr_reg ; 
Update_val_out=Updat e_val_in ; 
Fan_out_gen_f lg=l ; 
end 
end 



always @(negedge Clock) 
begin 

if (iReset^en && Fan_o\it_gen_f Ig) 
begin 

if (Counter>0) 
begin 

Out_adr_reg=Fan_out_vect:or_tbl [List_posl ; 
Iiist_pos=List_pos+l ; 
Count eir=Counter- 1 ; 
end 
else 

Fan_out_gen_f lg=0 ? 

end 
end 
endmodule 



wo 01/01298 



PeT/IEOO/00083 



-44- 



Input-value Bank 

Description: The bank contains the current values of all the signals in the circuit Each location 
in the bank corresponds to a wire. Since a word at any location is 3 bits wide, up to 8-valued 
logic can be simulated (this can be augmented by increasing the word width). The current value 
of any wire is shifted from this bank into Array.lb when time is incremented. This is done in 
parallel* Only wire values that have changed in the coiratit time interval are updated. 

module Input_val_bank { Inp_val_reg , Adr_reg , Clock, Shf t_enbl , Wrt_enbl , 

Out_buf f r_reg) ; 

//Iiip_val_rog contains th.e new value of a signal (i.e. word) in Inp_val_ary- The 
location of the wire is specified in Adr_r«g and the write operation takes effect 
on the negative ' edge of CXocle if Wrt.aabl is asserted. If Sb£t_ezxbl is asserted 
then the right-most bit of every location is . shifted into the 1-bit coluroci- 
register Cutjbuf f x^r^g on the positive edge of CXoefc. All shifted bits are also 
written into the right -most bit of Inp_val_ary (i.e a rotation)! thus all current 
values have been retained after the shifting out process. // 

pa r ame t er Inp_va l_wdth= 2 ; 

parameter Adr_reg_bits=13 ; 

par ame t er Inp_bnk_s i2e=16383; 

par ame t er Ls r 7 5 5 2 _X np_bnk_s i2e=8784; 

input Clock , Shf t_enbl , Wrt__enbl ; 
input [ Inp_val_wdth : 0 ] Inp_val_reg ; 
input E Adr_r eg_bi t s : 0 } Adr_r eg ; 

ou tpu t { Inp_bnk_s is e : 0 1 Ou t_bu f f r_r eg ; 
r eg I Xnp„bnk_s ize : 0 ] Out_buf f r_r eg ; 

r eg [ Inp_val_wdth : 0 ] Inp_val_ary ( 0 : Inp_bnk_s i z e 1 ; 

r eg [ Inp_va l_wdth : 0 ] Temper eg ; 
reg Temp„bit; 

integer Inp_ary_indx , i ; 

ini tial $readineinb ( " Inpval . dat " , Inp_val_ary ) ; 

//Inpval.dat is the file which initialises the current input values of all gates 
in the simulated circuit. All values are assigned ^Unknown' logic values except 
those primary inputs which are assigned logic ^0' or '1'.// 

always @ (posedge Clock) 
begin 

if {Shft^enbl} 
, begin 

for ( Inp_ary_indx= 0 ; Inp_ary_ind3c<=Ls r7 5 5 2_Inp_bnk_s i z e ; 

Inp_ary_indx=lnp_ary_indx+1 ) 

begin 

Temper eg= I np_val_ary [ Inp_ary_indx ] ; 
Ten^_bi t=Temp_reg [ 0 ] ; 

Out_buf f r_reg ( Inp_ary_indx] =Temp_bit ; 
Texnp_reg [1:0] =Temp_reg [ Xnp_val _>7dth : 1 ] ; 
Temper eg [ Inp_val_wdth] =Temp_bit ; 
Inp_val„ary ( Inp_ary_indx] =Temp_reg ; 
end 

$display(" (shf t ) time=%d" ,$time> ; 

end 



else 
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if (Wrt^enbl) 
begin 

Inp_val_ary [ Adr_reg 3 = Inp_val_reg ; 
end 

end 
endmodule 

The Sequence Logic of the APPLES Processor 



parameter Nibl=3 ; 
parameter Ary_la_wdth=7 ; 
parameter Ary_lb_adr„reg_wdt:h=13 ; 
parameter Ary_la_s ize= 16383; 
parameter Ary_lb_si2e=16383 ; 
parameter Eval_ptm_tbl_si2e=€3 ; 
parameter Eval_pt2:n_vctr_tbl_size=3 1 ; 
parameter Num_tst_wdth=7 ; 
parameter Num_ts t_ptrn_tbl_s ize=3 1 ; 
parameter Gate_maskla_tbl_size=3 1 ; 
parameter Gate_inptla_tbl_size=31 ; 
parameter Trr_ptm_tbl_size=3 1 ; 
parameter Grr_ptm_tbl_size=31 ; 
parameter Out_val_tbl_si2e=31 ; 



parameter Wlr_wrdsize=31 ; 
parameter Trr_wdth_spec=2 ; 
parameter Trr_word_s i ze5= 7 ; 
parameter Gr r_mem_s i ze= 8191; 
parameter Grr_wdt:h_spec=s2 ; 
parameter Grr_word_size»7 ; 
parameter Iu_wor d_s i z e=7 ; 
parameter Iu_wdth_spec=2 ; 
parameter Vectr__tbl_adr_reg=13 ; 
parameter Max__f an_out=7 ; 
parameter Inp_val_wdtb=2 ; 
parameter Vectr_tbl_adr_size=163 83 ; 

parameter Index_reg_wdth=7 ; 

parameter Num_tst_seq=12 ; //No of gates X No Transitions 
parameter Nuin_tst_cnt_wdth=3 ; 
parameter Init_shf t_val=3 ; 
parameter Shf t_cnt_wdth=3 ; 

wire Clock; 

wire { Ary_la_si2e : 0 ] Wrd_ln_act iv_ls t , Trr_bnk_xnp_reg ; 

wire [ Ary„lb_size : 0 ] Xnval_unit_out_reg ; 

wire [Grr„mem_size : 0 } Grr_bnk__inp_reg , Grr_bnk_hi t_ls t ; 

wirelM2ac_f an^out : 0) Mrr_unit__f an_out_si2e_reg; 

wire I Vectr_tbl_adr_reg : 0 ] Mrr_unit_f an_out_src_reg ; 

wire ( Inp_vai_wdth : 0 3 Fo_gen_\ini t_val_out ; 

wire I Vectr_tbl_adr_size : 0 1 Fo_gen_unit_out_adr_reg ; 



reg Tst_seq_strt ; 

reg eO , el , e2 , e3 , e4 , e5 , e6 , e7 , e8 , e9 , elO , ell , el2 , el3 . el4 , 

el5 , el6 , el6a , el6b, el7 , el8 , el9 , e2Q , e21 , e22 , e23 , e24 , e25 , e26 , e27 , e28 , e29 , 
Deact_srchla , Gate_eval_init_proc ; 

reg [ Index_reg_wdth : 0 1 Ept_i , Epvt_i , Ntpt_i , Gmla t_i , Giiat^i , 

Tpt_i , Gr it_i , Grmt_i , Ovt_i ; 

reg[Wlr_wrdsize : 0} Eval_ptrn_tbl (0 :Eval_ptrn_tbl_si2e] ; 
regtWlr_wrdsize:0] Evalj>tm__vctr_tbl[0:Eval^tm_vctr_tbl_si2e] ; 
reg[Num_tst_wdth:0] Num_tst:_ptm_tbl lO:Num„tst_ptrn_tbl_size3 ; 
reg ( Ary_la_wdth : 0 1 Gate_maskla_tbl [ 0 : Gate_maskla_tbl_si2e) ? 
reg[Ary_la_wdth: 0} Gate_inpt:la_tbl [ 0 :Gate_inptla_tbl_sizel ; 
reglTrr_word_si2e:0] Trr_ptrn__t:bl [0 :Trr_ptm_tbl_size] ; 
reg[Grr_word_size:0] Grr_inpt_tbl [ 0 ;Grr_ptrn_tbl_size] ; 



wo 01/01298 



,-46- 



pcT/TEoayoooss 



reg [Grr_word_si2e: 0) Grr_mas)c_tbl [0 :Grr_ptrn_tbl_si2e] ; 
reg [ lnp_val_wdth : 0 1 Out_val_t:bl [ 0 : Out_yal_fcbl_s ize ] ; 

reg [Grr jword_size : 0 3 Grr_bnk_search_reg , Grr_bnk_inask_reg ; 

reg [ Gr r.wdth_spec : 0 } Grr_bnk_wr t_pos ; 

reg [ Trr_wdth_spec : 0 ] Trr_bnk_wr t_pos ; 

reg [Trr.word^size : 0 ] Trr_rsi t_act_reg, Trr_rslt_act_and_0 ; 
reg ( Iu_word_si2e : 0 3 Inval_unit_adr_reg; 

reg [ Iu_wdth_spec : 0 ] Fo_gen_uni t _val_in , Inval_uni t_in_r eg ; 

reg Search_ary_la , Write_enbl_la, Ary_lb_wrt_enbl , Wlr_bnk_search_enbl , Shf t_ary_lb, 
Ary_lb_rd_enbl , Trr_bnk__wrt_enbl , Trr_bnk_corab__enbl , Trr_bnk_rset , 
Grr_bnk_search_enbl , Grr_bnk_wrt_enbl , Mrr_uni t_rset , Mrr_unit„decrint_enbl , 
Mrr_unit_rset_bit_fnd_f Ig, Fo_gen_unit_load, Fo_gen_unit_rset , 
lnval_unit_shf t_enbl , Inval_unit_wrt_enbl ; 

reg { Ary_la_wdth : 0 3 Inp_r egla , Mask_regl a , Adr_r egla ; 

regtWlr^wrdsize: 03 Inp_reg_lb, Sear ch_reg_lb, Mas k_reg_lb; 

reg [ Ary_lb_adr_r eg_wdth : 0 3 Adr_r eg_lb ; 

r eg ( Nian^ts t_cnt:_wdtli : 0 3 Nuin_ts t_cnt ; 

reg [ Shf t_cnt_wdth : 0 3 Slxf t_cnt ; 



Ary_la Gate_id_bnk ( Inp_regla , Masker egla , Adr_regla , Clock, 

Sea2rch_ary_la,Write_enbl_la, Wrd_ln_activ_lst ) ; 

Ary_lb Wrd_ln^regr_bnk{Searcli_reg_lb, Mask_reg_lb, Adr_reg_lb, 

lnp_reg_lb, Out_reg_lb, Trr_bnk_inp_reg, S):if t_ary_lb, 
Wlr_bnk_search_enbl , Ary_lb_wrt_enbl , Ary.lb_rd_enbl , 
Clock, Inval_imit_out_reg,Wrd^ln_activ_lst) ; 

Tst:_rs lt_reg„bank Tr r_bnk (Trr_bnk_inp_reg , Trr_biik_wr t_enbl , Trr_bnk_coinb_enbl . 

Clock , GrrjDiik^inp_r eg , Trr_rs 1 t_ac t^r eg , 
Trr _bnk_wr t_pos , Trr„bnk_rset ) ; 

Grp_r s 1 t_reg_bank Gr r_bnk ( Grr_bnk_inp_reg , Grr_bnk_mask_reg , 

Grr_bnk_searc3i_reg , Clock, Grr^bnk^s ear ch_eiibl , 
Grr_bnk_wrt_enbl , Grr_bnk_jtfrt_pos , Grr_bnk_hit_lst ) ; 

Multiple_res_res Mr r_unit (Grr_bnk„hit_lst, Clock, Mrr_unit_rset. 

Mrr_unit_end_scan_f Ig , Mrr_imit_decrmt:_eiibl , 
Mrr_unit_f an_out_src_reg , 
Mr r_un i t_f an_out_s i 2e_r eg , 
Mrr_unit_rset_hi t_f nd_f Ig # 
Mrr_unit_liit_fnd_f lag) ; 

Fan_out_gen Fo_gen_unit {Fo_gen_\mit_load , Fo_gen_\init_f Ig, Fo_gen_uni t^rset , 

Fo_gen_unit_val_in, Clock, Fo^en_unit_val_out , 
Mrr_unit_fan_out_size_reg,Mrr_uiiit_fan_out_src_reg, 
Fo_gen_iinit_out_adr_reg ) ; 

Input_val_bank Inval^unit ( Fo_gen_uni t_val_out , Fo_gen_\init_out_adr_reg , Clock , 

Inval_uiiit_shf t_enbl , Inval_unit_wrt_enbl , 
Inval_unit_out_reg) ; 

Ck_gen Clk_unit (Clock) ; 



integer i,Tst_n\im, iter_cnt; 
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initial 
begin 

$display { "Initialisation coinmencxng- *' ) ; 

Sreadmemb < " Ep_tbl . dat " , Eval_ptm_tbl ) ; 

Sdisplay ( "Ep_tbl . dat loaded . " ) ; 

$readmeroh ( « Epv^tbl . dat " , Eval_ptrn_vctr_tbl ) ; 

$dispiay ( Epv^tbl . dat loaded . " ) ; 

$readmemh ( "Ntp_tbl . dat" ,Nuin_tst_ptrn_tbl) ; 

$display ( « Ntp_tbl . dat loaded . " ) ; 

Sreadmemb { "Gila^tbl . dat " , Gate_inptla_tbl ) ; 

Sdisplay ( "Gila.tbl . dat loaded . M ; 

$readmerob { • Gmla.tbl . dat " , Gate_maskla_tbl > ; 

$display ( "Gmla_tbl . dat loaded. " ) ; 

$readmeinb{ "Tp_tbl.dat" ,Trr_ptm_tbl) ; 

$dispiay ( " Tp_tbl . dat loaded . " ) ; 

$readmeinb ( "Gi^tbl . dat " , Grr_inpt_tbl) ; 

$display ( " Gi_tbl . dat loaded . " ) ; 

$ display ( "Gi_tbl .dat loaded, " ) ; 
. $readmenib ( "Gm_tbl . dat " . Grr_mask_tbl ) ; 

$di'splay ( "Gro^tbl .dat loaded. " ) ; 

$readmenib ( " Ov_tbl . dat " , Out_val_tbl ) ; 

$display ( " Ov_tbl . dat loaded . " ) ? 

$display( "Table initialisation secjuence completed") 

Gate_eval_init_proc=l ? 
iter_cnt=0 ; 

Nuin_ts t_cnt=Sum_tst_seq; 
Inval_uni t_shf t_enbl=sO ; 

Ept_i=8'h00; Epvt_i=8 'bOO; Ntpt_i=8 ' hOO ; 
Gralat_i=8'h00; Gilat.i=8 'hOO ; Tpt_i=8'h00; 
Grit_i=8'h00; Grmt_i=8*h00; Ovt_i=8'h00; 
end 

always @(rxegedge Clock) 
if (Gate_eval_init_jproc> 
begin 

$display("Gate_eval_init_proc @ tiine=%d" , $time) ? 
it:er_cnt= i t er_cnt + 1 ; 

$display ( '* Iteration count=%d" , iter_cnt) ; 
Gate_eval_init_proc=0 ; 
Deact_srchla=0 ; 

eO=0? el=0; e2=0 ; e3=0; e4=0 ; e5=0; e6=0 ; 
e7=0; e8=0; e9=0 ? elO=0; ell=0; el2=0; el3=0; 
el4=0; el5=^0; el 6=0; el6a=0; el6t>=0; el7=0; 
el8=0; el9=0; e20==0; e21=0; e22=0; 

Inp_regla=Gate_inptla_tbl [Gilat_il ; 
Mask_regla=Gat:e_inaskla_tbl [Gmlat_i3 ; 
Tst_nuin=Num_tst_ptrn_tbl [Ntpt_i} ; 
Ept_i=Eval_ptrn_vctr_t:bl[Epvt_i] ; 
Mrr_ujiit_decnnt_enbl=0 ; 
Tst_sec3_strt=l ; 
Wl rLbnk_s ear ch_enbl= 0 ; 
lnval_unit_wrt_enbl=0 ; 

end 

always ©(posedge Clock) 
begin 

if (Tst_seq_strt) 
begin 

Trr_bnk_r set=l ; 
Search_ary_la=l ; 
eO=l; 

Tst_seci_strt=0 ; 
end 
end 
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always @(negedge Clock) 
begxn 
if (eO) 
begin 
eO=0; 

Deact_srchla==l ; 
end 
end 



always &(posedge Clock) 
begin 

if (Deact_srchla) 
begin 

Trr_bnk_rset=0 ; 
Deact_srchla=0 ; 
Search_ary_la=0 ; 
el=l; 

i=Trr_word_s ize ; 
end 
end 



always d(negedge Clock) 
begin 
if (el) 

begin 

el-0; 

e2=l; 

end 
end 



always @ (posedge Clock) 
begin 
if (e2) 
begin 

Wlr_bnk_searcl:i_enbl=l ; 
Search_reg_lb=Eval_ptrn_t:bl [Ept_i 1 ; 
Mask_reg_lb=Eval_pt:m_tbl fEpt__i+l} ; 
e2=0? • 
e3=l; 
end 
end 

always 8 (negedge Clock) 
begin 
if (e3) 

begin 

e3=0; 

e4=l; 

end 
end 



always @ (posedge Clock) 
begin 
if (e4) 
begin 

Tr r_bnk__wr t _enb 1=1; 
Trr_bnk_wrt_pos = i ; 
Wlr_bnk_search_enbl=0 ; 
e4=0 ; 
e5=l; 
end 
end 
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always @(negedge Clock) 
begin 
if leS) 

begin 

e5=0; 

e6«l; 

end 
end 



always ©(posedge Clock) 
begin 
if (e6) 
begin 

Tst_num=Tst_naiiB-l ; 

i=i-l; 

e6=0; 

if |Tst_niHti> 0) 
begin 
el=l; 

Ept_i=Ept_i+2 ; 

$display ( "Ept_i (updated) =%d" , Ept_i) ; 
Trr_hnk_wrt_enbl=0 ; 
end 
else 

begin 

Trr_bnk_wrt_enbl=0 ; 
i =Tr rjwor d„s ize ; 

Trr_rs 1 t_ac t_r eg^^Trr^ptrn^tbl t Tpt_i ] ; 
Tst_niiin=Nuin_tst_ptrn_tbl[Ntpt_il ; 
e7=l; 
end 

end 
end 

always d{negedge Clock) 
begin 
if (e7) 

begin 

e7*0? 

e8=l; 

end 
end 



always @ (posedge Clock) 
begin 
if (e8) 
begin 

Trr_bnk_corob_enbl=l ; 
Trr_bnk_wrt_pos=i ; 
e8=0; 
e9=l; 

$display( "Commencement of TRR tests for Gate type=%b" , Inp_regla, "at 
time=%d" , $time) ; 

end 
end 

always @{negedge Clock) 
begin 
if (e9) 

begin 

e9=0; 

elO=l ; 

end 
end 
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always @ (posedge Clock) 
begin 
if (elO) 
begin 

Trx.bnk_comb_enbl=0 ; 
Grr_bnk_wrt_enbl=l ; 
Grr_bnk_wr t_pos= i ? 
elO=0 ; 
ell=l; 
end 
end 

always ©(negedge Clock) 
begin 
if (ell) 

begin 

ell=0; 

el2=l; 

end 
end 

always @ (pes edge Clock) 
begin 
if (el2) 
begin 

Tst_nuiii=Tst_num-l ; 

i=i-l; 

el2=0 ; 

if (Tst_num>0) 
begin 
e9=l; 

Trr_bnk_conib_enbl=l ; 
Trr_bnk_vnrt_pos=i ; 
Grr jDnk_wrt_enbl=0 ; 
end 
else 
begin 
el3=l; 

Gi:r_bnk_wrt_enbl=0 ; 
end 

end 
end 



always @ (negedge Clock) 
begin 
if (el3) 

begin 

el3=0? 

el4=l ; 

$display( -Termination of Trr tests for Gate type=%b'Mnp_regla, "at 
times=%d" , $time) ; 

end 
end 



always @ (posedge Clock) 
begin 
if (614) 
begin 

Grr„bnk_search__reg==Grr_inpt_tbllGrit_i] ; . 

Grr_bnk_mask_reg=Grr_inask„tbl[Grnit_i3 ; 

Grr_bnk_search_enbl=l ; 

Fo_gen_uni t_r s e t = 1 ; 

el4=0; 

el5=l; 

end 

end 
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always S(negedge Clock) 
begin 
if (el5} 

begin 

el5=0; 

el6=l; 

end 
end 

always ©(posedge Clock) 
begin 
if {el6) 
begin 

Mrr_unit;_rset=l ; 
el 6=0; 
el6a=:l; 
end 
end 

always ©(negedge Clock) 
begin 
if {el6a) 
begin 

Mrr_unit_rset=0 ; 
el6a=0; 
el6b=l ; 
end 
end 

// Propagate values to gates affected in fem^out lists. 

always ^(posedge Clock) 
begin 
if {el6b) 
begin 

Grr_bnk_searcli_enbl=0 ; 
Mr r_uni t_decnnt_enbl = 1 ; 
Fo_gen_unit_rsetsO ; 

Fo_gen_unit_val_in=Out_val_tbl [Ovt_i3 ; 

el6b=0 ; 

el7=l; 

$displ ay ("Start of fanout list at time=%d" , $tiiDe) ; 
end 
end 

always 8 (negedge Clock) 
begin 
if (el7) 
begin 

Fo_gen_uni t_l oad= 0 ; 
el7=:0; 
el8=l; 
end 
end 



always @(posedge Clock) 
begin 
if (el8) 
begin 

if (Mrr_unit_hit_fnd_flag) 
begin 

F'o_gen_vinit_load=l ; 

el 8=0; 

el9=sl; 
end 
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else 

if ( ( !Mrr_unit_hit_fnd_f lag) k (Mrr_unit_end_scaii_f Ig) ) 
begin 

el8==0? 
e22=l; 

Mrr_imit_decrnit_enbl=0 ; 
end 
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end 
end 



always © (negedge Clock) 
begin 
if (el9) 
begin 
Fo_jgen_unit_load=0 ; 
Inval_uni t_wr t_enbl= 1 ; 
Mrr_unit:_rset_hit_fnd_f lg=^0 ; 
el9=0'; 
e20=l; 
end 
end 

always @(posedge Clock) 
begin 

if (e20) 
begin 

if ( I Fo_gen_unit_f Ig ) 
begin 

if (! Mrr_\mit__end_scan_f Ig) 
begin 

Mrr_imi t_xset_hit_f nd_f lg=l ; 
lnval_unit_wrt_enbl=0 ; 
e20=0; 
e21*l; 
end 

else 



begin 

Inval_unit:_wrt_enblsO ; 
e20=0; 
e22=l ; 
end 
end 
end 
end 



always @ (negedge Clock) 
begin 
if <e21) 
begin 
el8=l; 
e21=0 ; 
end 
end 



always ©(negedge Clock) 
begin 
if (e22) 
begin 
e22=0; 
e23=l; 

Epvt:_i=Epvt„i+l ; Ntpt_i=Ntpt_i+l ; 
Gmlat_i=Gnilat:_i+l ; Gilat_i=Gilat_i-i-l ; 
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Tpt_i=Tpt:_i + l ; 

Grit_i=Grit_i+l; Grmt_i=Grmt_i+l ; 
O vt_i -Ovt_i + 1 ; 

$display( "Termination of Fan out update, time=%d" , $tixne) ; 
end 
end 



always ^(posedge Clock) 
begin 
if (e23) 
begin 
e23*0? 

NuHi_tist_cnt=Num_tst_ciit-l ; 

if (Num_tst_cnt==0) 

begin 

e24=l; 

end 
else 

Gate_eval_init_proc=l ; 
end 
end 



always © {negedge Clock) 
begin 
Q if (e24> 

n begin 

$display ("E24 attained, End of fanout update- *) ; 

^ $display(" 

inval_uni t_shf t_enbl=l ; 
O Slxf t_cnt=Ini t_shf t_val ; 

m e24=0; 
,g e25=l; 
end 
end 



/ / Input_val_bank is +ve edge triggered . Thus next block is -ve edge • 



always S (posedge Clock) 
begin 
if (e25) 
begin 

$display{"E25 attained " ) ; 
Shf t_ary_lb=l ; 
e25==0; 
e26»l; 
end 



end 



always & (negedge Clock) 
begin 
if (e26) 
begin 

$display( "E26 attained "); 
Shf t_cnt=Shf t_cnt-l ; 
if (Shft_cnt==0) 
begin 
e26-0; 

Inval_unit_shf t_enbl=0 ; 
e27=l; 
end 

end 
end 



always ©(posedge Clock) 
begin 
if {e27) 
begin 
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Shft_ary_lb~0; 
e27=0; 
e28=l; 
end 

end 



always @(negedge Clock) 
begin 
if (e28) 
begin 
e28=0; 
e29=l? 
end 
end 



always @(posedge Clock) 
begin 
if (e29) 
begin 

Gate_eval_init _proc=l; 
Nuin_fc s t_cnt: =Nuin_ t s t_s eq ; 

Ept_i=8^h00; Epvt_i=8'h00; Ntpt_i=8 »hOO ; 
Gmlat_i=8 'bOO; Gilat_i=8 • hOO,- Tpt_i=8*h00; 
Grit_i=8'h00; Gnnt_i=8 ' hOO ; Ovt_i=8'h00; 
e29=0; 
end 
end 



endmodule 
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The APPLES architecture is designed to provide a fast and flexible mechanism for 
logic simulation. The technique of applying test patterns to an associative memory 
culminates in a fixed time gate processing and a flexible delay model. Multiple 
scan registers provide an effective way of parallelising the fan-out up-dating 
5 procedure. This mechanism eliminates the need for conventional parallel 
techniques such as load balancing and deadlock avoidance or recovery. 
Consequently, pargJIel overheads are reduced. As more scan registers are 
introduced, the gate evaluation rate increases, ultimately being limited by the 
average fan-out list size per gate and consequently the memory bandwidth of fan- 
10 out list memory. 

Referring to Fig. 8, there Is illustrated an array indicated generally by the reference 
numeral 20 comprising a plurality of cells 21 , each of which comprises an APPLES 
processor as described above. A synchronisation logic control 22 is provided. The 

15 circuit that is to be simulated is split up among the APPLES processor. Gate 
evaluations are carried out independently in each processor or cell 21 . Each cell 
21 Is provided with a local Input value register bank and a foreign input value 
register bank to allow interconnection which is done through an interconnecting 
network 23 incorporating the synchronisation logic 22. Connections between the 

20 synchronisation logic circuit 22 which is, strictly speaking, the main synchronisation 
logic circuit, to each of the cells 21 is not shown. 

After all gate evaluations for all gate types and the corresponding updates have 
occurred, on a given processor forming a cell 21 , the processor must wait for all 

25 other processors to reach the same state. When all processors reach this state 
then the respective input value register banks can be shifted into the respective 
array and associative register lb and evaluation of the next time unit can occur. 
Thus, to achieve implementation, there Is required that a suitable interconnecting 
network must be designed and an interface to the APPLES processor constructed. 

30 A synchronisation method must exist to determine when evaluation of the next 
time unit should proceed. A system to split the hit list information amongst the 
processors is required in order to initialise the system. 

The array of processors is implemented as a torus (equivalent to a 2D mesh with 
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wrap-around) as shown in Fig. 8. The inclusion of wrap-around connections 
reduces frie network diameter increasing the network speed. It also means that 
each processor can be identical without wasted hardware at the edges of the array, 
it does however require a more complicated routing mechanism. No set size was 
5 used for the array instead the size was used as a criteria which was varied during 
simulations. This criterion was specified by a command line parameter to the 
Verilog compiler. These command line parameters are covered in detail in the next 
chapter. 

Each cell is connected to its four neighbouring cells via seriah connections. 
Obviously parallel connections would be faster. However a Virtex FPGA was used 
and it has a limited number of pins. It may happen that not all of these pins are 
available to a particular design due to the FPGA architecture. Pins are therefore a 
precious resource. Since each FPGA would require eight parallel connections (an 
input and an output connection on each of the four edges) this would require a 
large number of pins. If at a later stage it is discovered that there are spare pins 
and a parallel network is justified then the design could be altered. In this design 
each cell has a serial input and a serial output on each of its four edges. These 
serial connections each consist of a data line and two control lines. These serial 
connections will therefore require 12 pins on each Virtex FPGA. Each cell is also 
connected to the array's synchronisation logic. 

In order to design the network knowledge of the information that the network must 
carry is required. The network is required in order to pass fan out updates between 
25 processors. These updates can be passed as messages. Each message is an 
update and consists of a destination address and an update value. A single Virtex 
FPGA was used to implement an APPLES processor capable of simulating a circuit 
with approximately 256 gates. This figure is somewhat arbitrary and further design 
work will reveal the true value required. Given a restraint of 256 gates per 
30 processor approximately 64 processors would be required to simulate a reasonably 
complex circuit. This corresponded to an 8 x 8 array. Each processor will need to 
be able to send updates to any other processor updating any one of their 512 gate 
inputs. This implies an address space of six to identify the processor and an 
address space of nine to identify the wire. Each update sent also requires an 
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update value. These are three bits wide (enabling support for eight-state logic). 
Therefore messages sent from processor to processor will need to be eighteen bits 
wide. These figures are arbitrary but are a useful starting point. 

5 The structure of a cell 21 is shown in Fig. 9. Each of the four edges has a 
transmitter 25 and a receiver 26. These modules deal with the serial connections. 
The transmitter 25 t^es in an eighteen-bit entity and sends it out in a bit stream. 
The receiver 26 takes in the bit stream and reconstitutes it into the original 
eighteen-bit message. 

10 

A request scanner 27 checks every receiver 26 and the APPLES processor 30 
simultaneously to see if they have messages waiting to be routed. It assigns each 
of these sources a rotating priority and picks the source that has a message and 
the highest priority. It then passes the picked message to a request router 28. 

15 

The request router 28 passes its messages either to the APPLES processor 30 or 
to a transmitter 25. If ttie option chosen is a transmitter then the message will be 
sent to a different cell 21 . If the option chosen is the APPLES processor 30 then 
the message is an update for the local processor. A synchronisation logic circuit 31 
20 controls the cell 21 through the synchronisation logic circuit 22. 

In Fig. 9 every transmitter, every receiver and the input and output ports of the 
APPLES processor have buffers connected. A command line parameter to the 
Verilog compiler specifies whether these components are to be used or removed 
25 from the design. One slightly different behaviour of these buffers is that they 
process data in a LIFO fashion. The effect of these buffers on performance is an 
important part of the system analysis. 

The request router 28 employs one of two different routing techniques. The 
30 technique used is determined by a command line parameter to the Verilog 
simulator used to implement the invention. A comparison of the routing techniques 
is important to the understanding of the invention. Both routing techniques operate 
in a similar manner. 
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The request router 28 decodes the message. It can then determine the destination 
processor. It determines ail the valid options for routing the message. The 
message could be routed to the local APPLES processor 30 or to one of the 
transmitters 25. The message is then routed to one of the valid options. 

5 

The first routing technique only produces one valid routing option and if that route is 
not blocked then the message is routed in that direction. If it is blocked then the 
request router 28 attempts to route a different message. Messages are passed 
from cell 21 to cell 21 until they reach their destination. Under this routing 

10 technique a message is passed first either in the east or west direction until it is at 
the correct east-west location. It is then routed in the north or south direction until 
the message arrives at its destination. The net result of the message passing is 
that the message travels the minimum distance. This routing strategy results in 
the traffic between any two given cells 21 always following the same route through 

15 the network. This routing strategy can be called standard routing. 

The second routing technique is more complicated. Under this strategy the request 
router 28 determines all of the available directions that can be taken by the 
message which vwll result in it travelling the shortest distance. The various options 

20 have different priorities associated with them. This priority is based on the options 
that were previously taken. This priority method helps to use the various routes 
evenly and therefore efficiently. Some of the options may not be feasible as they 
may be in use with previous messages. An option is chosen based on priority and 
availability. The priority information is then updated. This routing strategy is an 

25 advanced routing. 

For both routing techniques, when all valid paths are blocked and the request 
router 28 is unable to route its message then it simply drops the message. This is 
an important aspect to the manner in which the request scanner 27 and request 
30 router 28 work together. The request scanner 27 takes a message from one of its 
sources. It does not inform the source that it is attempting to route this message. 
The source maintains the message at its output. If the request router 28 
successfully routs the message then it tells request scanner 27 that it has done so 
and the request scanner 27 informs the source. This way the request router 28 is 
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not committed to routing a particular message. The request router 28 therefore is 
always free to attempt to route messages. 

The network interface 42 shares access to the input value register bank 20 
5 between the local processor and the network. The local processor gets priority. 
This module decodes the message and updates the appropriate location in the 
input value register bank 2. 

The network interface 42 is connected between the fan out generator 43 and the I 
10 Input value register bank 2. It can therefore pass fan out updates from the 
processor to the network when appropriate or simply pass them to the input value 
register bank 2. It can also pass fan out updates from the network to the input 
value register bank 2. Some changes were required in the fan out generator 43 to 
accommodate the network interface 42. 

15 

When each processor in the array has processed the fan out list for each of its 
active gates and all updates have reached their destination then each processor 
can shift its input value register bank 2 into its array lb and proceed with evaluation 
of the next time unit. In order to achieve this some synchronisation logic, between 

20 the cells 21 , is required. The implementation for this requires each processor to 
report to its cell 21 when it has completed sending updates. Each cell 21 also 
monitors the network activity and reports back to the array stating whether there is 
network activity or processor activity. The array therefore knows when all 
processors are finished updating and when the network is empty. At such a time 

25 the array reports back to the cells 21. Then the cells 21 tell the processors to 
proceed with the next time unit in the delay model The implementation of this 
system required minor changes in the sequence logic of the APPLES processor. 

The network is not used to communicate this synchronisation information. Instead 
30 dedicated wires are provided. Each cell 21 has a finished input wire and a finished 
output wire. The cell 21 holds the finished output wire high when its processor has 
finished and no network activity is occurring around the cell 21. The finished input 
wire is controlled by .the array synchronisation logic. The array holds it high when it 
detects that all the finished output wires are high at the same time, it would be 
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possible to use the network to communicate this synchronisation information. This 
would reduce the number of Virtex pins required by the design. However the 
synchronisation logic would be more complex and require more circuitry. The 
synchronisation process would also take longer to execute. 

5 

The information pertaining to the circuit description is stored in five memories within 
an APPLES processor. Under the basic APPLES Verilog deagn these memories 
are loaded from data files using the $READMEM system command. For the 
system to be implemented on a Virtex chip these memories could be loaded via a 
10 PCI interface. 

Under the APPLES array each processor evaluates part of the circuit to be 
simulated. The contents of these five memories need to be split among the 
processors in the an-ay. The memory contents also need to be processed in order 
to make it compatible with the array design. Under an implementation using an 
an-ay of Virtex chips this data could be loaded via a PCI bus and distributed using 
the array netwoi1<. The data would be pre-processed for the an-ay and each 
processor would simply need to load the data into its memories. The incorporation 
into the design of a system to distribute this data is non-trivial. This project is 
mainly concerned with the analysis of the array design's ability to simulate circuits. 
An analysis of the an-ay's initialisation system is not of paramount importance at 
this time. As a result the initialisation system was not designed. 

In order to initialise the design, to facilitate simulating circuits, a Verilog task was 
25 written to load the memories. The single processor circuit description files are 
loaded into a global memory in the design. Each processor in tiie array is assigned 
a number. A processor's number is calculated by multiplying its y co-ordinates by 
the array width and adding its x co-ordinates. Each processor loads a segment of 
the global Array la, Array lb, the fan out header table and the fan out size table 
30 into its local memory. These segments are of equal size. The segments chosen 
are based on their processor number. Processor zero takes the first segment, 
processor one takes the second segment and so on. A segment of the fan out 
vector table must be loaded also. The segment is determined by looking at the 
contents of the local fan out size and fan out header tables. The first address to be 
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loaded from the global fan out vector table is the address stored in the first location 
in the local fan out header table. The last address to be loaded is calculated by 
adding the address stored in the last entry in the local fan out header table to the 
last fan out size stored in the final entry in the local fan out size table. The 

5 addresses within the fan out header table must be adjusted to point at the new 
local fan out vector table. This is achieved by subtracting the address stored in the 
first location In the local fan out header table from each address in the same table. 
Each gate input address stored In the local fan out vector table must be converted 
Into an array address. An array address consists of the destination processor's x 

10 co-ordinates stored in bits fourteen to twelve, the destination processor's y co- 
ordinates stored in bits eleven to nine and the gate input's local address on the 
destination processor stored in bits eight to ten. 

Using this system the circuit description is split among the processors. No 
15 consideration is given to decide which gate is simulated on which processor. The 
APPLES circuit description files determine where each gate is simulated. The 
layout of these files is determined by the layout of the iscas-85 net list files that 
were used to generate the APPLES circuit description files. 

20 Referring to Fig. 10, there is illustrated an alternative layout of processor in which 
parts similar to those described with reference to Fig. 1 are identified by the same 
reference numerals. In this embodiment, the scan registers are identified by the 
reference numerals 6a and the general logic sequence is identified by the reference 
numeral 40. The processor will also include a circuit splitting logic circuit 41 and a 

25 network interface 42. A fan out generator 43 is identified and will include, for 
example, the fan out memory 8. The network interface 42 shares access to the 
input value register bank 2. 

The original APPLES design is written in Verilog. So is the array design. The 
30 Verilog code is written at a behavioural level. This is the most abstract level 
available to a Verilog programmer. As with any Verilog system it is split into Verilog 
modules. Each module is a component of the system. The Verilog modules added 
under the APPLES array design are: 
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• The Top Module 

• The Array Module 

• The Cell Module 

• The Receiver Module 

5 • The Transmitter Module 

• The Request Scanner Module 

• The Request Router Module 

• The Buffer Module 

• The Network Interface Module 

10 

The Top module is used to test that the system is performing correctly. An 
instantiation of the Top module contains an instantiation of the array module. The 
array contains multiple instantiations of the Cell module. Each Cell contains four 
instantiations of both the transmitter and Receiver modules. A Cell also contains a 
15 Request Scanner, a Request Router, several buffers and an APPLES processor. 
The APPLES processor contains instantiations of the standard processor 
components along with an instantiation of the Network Interface module. This 
structure and the behaviour of these modules were described earlier in this 
chapter. Each of ttiese modules is contained within an appropriately named file. 

20 

in addition to designing these modules the array design also required the following 
changes: 

• The introduction of a Verilog task to split the circuit description information 
25 among the processors in the array. This is located in the APPLES processor 

module. 

• The incorporation of processor synchronisation logic into the APPLES 
processor module, the Cell module and the Array module. 

• The integration of the Network Interface module into the APPLES processor. 

30 

The APPLES architecture incorporates an alternative timing strategy which obviates 
the need for complex deadlock avoidance or recovery procedures and other 
mechanisms normally part of an event-driven simulation. The present invention has 
an overhead which is considerably less than conventional approaches and permits 
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gate evaluation to be activated in memoty. The reduction in processing overlieads is 
manifest in Improved speedup perfomiance relative to otiier techniques. 

A message passing mechanism inherent in the Chady-Misra algorithms has been 
5 replaced by a parallel scanning mechanism. This mechanism allows the fan- 
out/update procedure to be parallelised. As clashes occur gates are effectively put 
into a waiting queue which fills up an fan-out/update pipeline. Consequently as the 
pipeline fills up(with the increase number of scan registers), perfonnance increases. 
The speedup reaches a limit when the new gates entering the queue equals the fan- 
1 0 out rate. Nevertheless, the speedup and the number of cycles per gate processed is 
considerably better than conventional approaches. "ITie system also allows a wide 
rsuige of delay models. 

The bit-pattern gate evaluation mechanism in APPLES facilitates the implementation 
of simple and complex delay models as a series of parallel searches. Consequently, 
the evaluation process is constant in time, being perfonned in memory. Effectively, 
there is a one to one con-espondence between gate and processor (the gate word 
pairs). This fine grain parallelism allows maximum parallelism in the gate evaluation 
phase. Active gates are automatically identified and tiieir fan-out lists updated 
through scanning a hit-list. This scanning mechanism is analogous to 
Communication overhead in typical parallel processing architectures, however, tills 
scanning is amenable to parallelisation itself. Multiple scan-registers reduce the 
overhead time and enable tiie gate processing rate to be limited solely by tiie fan-out 
memory bandwidtti. The substantial speedup of tiie logical simulation witii tiie 
APPLES architecture is attained resulting in a gate processing rate of a few machine 
cycles. 

In this specification, the terms "comprise", "comprises" and "comprising" are used 
interchangeably with the terms "include", "includes" and "including", and are to be 
30 afforded the widest possible interpretation and vice versa. 

The invention is not limited to tiie embodiments hereinbefore described which may be 
varied in both construction and detail witiiin the scope of tiie claims. 
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CLAiMS 

1. A parallel processing method of logic simulation comprising representing 
signals on a line over a time period as a bit sequence, evaluating the output of 
any logic gate including an evaluation of any inherent delay by a comparison 
between the bit sequences of the inputs of the logic gate to a predetermined 
series of bit patterns and in vvhlch those logic gates v^ose outputs have 
changed over the time period are identified during the evaluation of the gate 
outputs as real gate changes and only ihose real gate changes are propagated 
to the fan out gates of those logic gates, characterised in tfiat 

the control of the method is canied out in an associative memory 
mechanism (1a, 1b) which stores in word form a history of gate input 
signals by compiling a hit list register of logic gate state changes and 
using a multiple response resoiver (7) forming part of the associative 
memory mechanism (1a, 1b) which generates an address for each hit, 
and then scans and transfers tine results on the hit list to an output 
register for subsequent use. 

2. A method as claimed in claim 1, in vMdn each delay is stored as a delay word 
in an associative memory (lb) forming part of the associative memory 
mechanism (la, lb) in which:- 

the length of the delay word is ascertained; and 

if the delay word width exceeds the associative register (lb) word 
width:- 

the number of integer multiples of the register word width contained 
within the delay v\/ord is calculated as a gate state; 

the gate state is stored in a further state register (la); 
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the remainder from the calculation is stored in the associative register 
(1b) with those delay words whose widths did not exceed the 
associative register word width; and 

on the count of the associative register commencing:- 

the state register (1a) is consulted for the delay word entered in the 
state register and the remainder is ignored for this count of the 
associative register, 

at the end of the count of the associative register, the state register is 
updated; and 

the count continues until the remainder represents the count still 
required. 

A* method as claimed in claim 1 or 2 in which the hit list is segmented into a 
plurality of separate smaller hit lists, each connected to a separate scan 
register and in which each scan register is operated in parallel to transfer the 
results to the output register. 

A method as claimed in any of claims 1 to 3 in which the associative register 
(lb) is divided into separate smaller associative sub-registers, one type of logic 
gate being allocated to each associative sub-register, each of which 
associative sub-registers has corresponding sub-registers connected thereto 
whereby gate evaluations and tests are carried out in parallel on each 
associative sub-register. 

A method as claimed in any of claims 1 to 4 in which each line signal to a 
target logic gate is stored as a plurality of bits each representing a delay of one 
time period, the aggregate bits representing the delay between signal output to 
and reception by the target logic gate and in which the inherent delay of each 
logic gate is represented in the same manner. 
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A method as claimed in claim 4 or 5 in which each associative sub-register is 
used to fomi a hit list connected to a corresponding separate scan register. 

A method as claimed in any of claims 1 to 6 in which where the number of the 
one type of logic gate exceeds a predetermined number more than one sub- 
register is used. 

A method as claimed in any of claims 3 to 7 in which the scan registers are 
controlled by exception logic using an OR gate whereby the scan is terminated 
for each register on the OR gate changing state thus indicating no further 
matches. 

A method as claimed in claim 8 in which the scan is carried out by sequential 
counting through the hit list and the steps are perfomied of: 

checking if the bit is set indicating a hit; 

if a hit, determining the address effected by that hit; 

storing the address; 

clearing tiie bit In the hit list; 

moving to the next position in the hit list; and 

repeating the above steps until the hit list is cleared. 

A method as claimed in any preceding claim, in which each line signal to a 
target logic gate is stored as a plurality of bits each representing a delay of one 
time period, tiie aggregate bits representing tine delay between signal output to 
and reception by the target logic gate. 

A method as claimed in any preceding claim in \A^ich tinere is an initialisation 
phase In which: 
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specified signal values are inputted; 
unspecified signal values are set to unknown; 

test templates are prepared defining the delay model for each logic 
gate; 

the input circuit is parsed to generate an equivalent circuit consisting of 
2-input logic gates; and 

the 2-input logic gates are then configured. 

A method as claimed in any preceding claim in which a multi-valued logic is 
applied and in which n bits are used to represent a signal value at any instance 
in time vAih n being any arbitrarily chosen logic. 

A method as claimed in claim 12 in which an 8-valued logic is used where 000 
represents logic 0, 111 represents logic 1 and 001 to 110 represent arbitrarily 
defined other signal states. 

A method as claimed in .claim 12 or 13 in which the sequence of values on a 
logic gate is stored as a bit pattem forming a unique vi«>rd in the associative 
memory mechanism (la. lb). 

A method as claimed in any preceding daim in which there is stored a record 
of all values that a logic gate has acquired for the units of delay of the longest 
delay in the circuit 

A parallel processor for logic event simulation (APPLES) comprising:- 
a main processor (30); 

an associative memory mechanism (la, lb) including a response 
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resolver (7); 

characterised in that the associative memory mechanism (1a, 1b) comprises:- 

5 a plurality of separate associative sub-registers each for 

the storage in word fomn of a history of gate input signals 
for a specified type of logic gate; and 

a plurality of separate additional sub-registers associated with each 
5 10 associative sub-register whereby gate evaluations and tests can be 

f carried out in parallel on each associative sub-register. 

m 17. A processor as claimed in claim 16, in which the additional sub-registers 

^ comprise an input sub-register, a mask sub-register and a scan sub-register. 



15 



18. A processor as claimed in claim 17, in whicli the scan sub-registers are 
connected to an output register. 
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